Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Align audio buffers for the benefit of SIMD #316

Open
Enyium opened this issue Jan 25, 2023 · 1 comment
Open

Align audio buffers for the benefit of SIMD #316

Enyium opened this issue Jan 25, 2023 · 1 comment

Comments

@Enyium
Copy link

Enyium commented Jan 25, 2023

I understand that anyone may allocate an audio buffer and pass it to a filter's GetAudio() function. But does AviSynth+ allocate audio buffers, or is it the player/whatever program who initially requests that allocates?

If it's AviSynth+, could it start to align audio buffers to the maximum bit width of SIMD operations currently available (i.e., 512 bits or 64 bytes) and also make the buffer size divisible by that amount? This would minimize "rest operations" in algorithms where SIMD can be used for audio.

It should then be documented that AviSynth+ does that, but that buffers from other requesters than AviSynth+ may not be SIMD-aligned and a respective multiple, and that you can't know who requests.

@DTL2020
Copy link

DTL2020 commented Feb 19, 2023

Same is good for frame buffers - to start first scanline from current SIMD machine required alignment address and also pad each scanline at the end so each scanline will starts from aligned addresses so scanline SIMD processing do not need to use additional prologue at the start of scanline processing with max possible dataword size at current available SIMD machine.

So software for beginners may be simpler (as requested some beginners HBD and optimizing example in https://forum.doom9.org/showthread.php?t=184726 thread) .

As I read at stackoverflow site/comments the unaligned load (cacheline splitting) still have significant penalty at AVX512 systems.

Also there exist second hidden feature of set-associative caches of very low ways-number (about 4..8 for L1/L2 and sometime about 12..16 for L3) - when processing 'blocks' with several scanlines strided load/store it is possible to hit 'cache set overload' condition and after all ways used the total cache stop works at all (software got large penalty of constant cache reload from very slow host RAM). So when allocating buffers for frames and selecting end of scanlines padding it may be good to check if for current processing use case with current memory 'multi-streams' load pattern the processing is safely enough against hitting cache-set overload condition. Classic usecase of this issue is mvtools->MDegrainN with (tr/2)>ways in cache and (4KB or 2/4MB large)page-aligned allocating frame buffers and MVs buffers. The solution is add at least random or better ordered (if possible) shift of cacheline size to startup addresses for 'parallel' accessed buffers. For interscanlines multi columns access it may be possible for large enough frame sizes like UHD. The mapped to single casheset addresses are fixed for current execution CPU and can be calculated from cache size, cacheline size and number of ways. So in general current CPUs caches data mapping is designed in such way that only linear reading of addresses can fill all cache (all sets and all ways).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants