Align audio buffers for the benefit of SIMD #316

Enyium · 2023-01-25T23:30:01Z

I understand that anyone may allocate an audio buffer and pass it to a filter's GetAudio() function. But does AviSynth+ allocate audio buffers, or is it the player/whatever program who initially requests that allocates?

If it's AviSynth+, could it start to align audio buffers to the maximum bit width of SIMD operations currently available (i.e., 512 bits or 64 bytes) and also make the buffer size divisible by that amount? This would minimize "rest operations" in algorithms where SIMD can be used for audio.

It should then be documented that AviSynth+ does that, but that buffers from other requesters than AviSynth+ may not be SIMD-aligned and a respective multiple, and that you can't know who requests.

The text was updated successfully, but these errors were encountered:

DTL2020 · 2023-02-19T16:42:06Z

Same is good for frame buffers - to start first scanline from current SIMD machine required alignment address and also pad each scanline at the end so each scanline will starts from aligned addresses so scanline SIMD processing do not need to use additional prologue at the start of scanline processing with max possible dataword size at current available SIMD machine.

So software for beginners may be simpler (as requested some beginners HBD and optimizing example in https://forum.doom9.org/showthread.php?t=184726 thread) .

As I read at stackoverflow site/comments the unaligned load (cacheline splitting) still have significant penalty at AVX512 systems.

Also there exist second hidden feature of set-associative caches of very low ways-number (about 4..8 for L1/L2 and sometime about 12..16 for L3) - when processing 'blocks' with several scanlines strided load/store it is possible to hit 'cache set overload' condition and after all ways used the total cache stop works at all (software got large penalty of constant cache reload from very slow host RAM). So when allocating buffers for frames and selecting end of scanlines padding it may be good to check if for current processing use case with current memory 'multi-streams' load pattern the processing is safely enough against hitting cache-set overload condition. Classic usecase of this issue is mvtools->MDegrainN with (tr/2)>ways in cache and (4KB or 2/4MB large)page-aligned allocating frame buffers and MVs buffers. The solution is add at least random or better ordered (if possible) shift of cacheline size to startup addresses for 'parallel' accessed buffers. For interscanlines multi columns access it may be possible for large enough frame sizes like UHD. The mapped to single casheset addresses are fixed for current execution CPU and can be calculated from cache size, cacheline size and number of ways. So in general current CPUs caches data mapping is designed in such way that only linear reading of addresses can fill all cache (all sets and all ways).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Align audio buffers for the benefit of SIMD #316

Align audio buffers for the benefit of SIMD #316

Enyium commented Jan 25, 2023

DTL2020 commented Feb 19, 2023 •

edited

Loading

Align audio buffers for the benefit of SIMD #316

Align audio buffers for the benefit of SIMD #316

Comments

Enyium commented Jan 25, 2023

DTL2020 commented Feb 19, 2023 • edited Loading

DTL2020 commented Feb 19, 2023 •

edited

Loading