inlining on x86_64 #3

JvanKatwijk · 2025-01-22T10:42:12Z

In getting some understanding of the package I was running the "run_simple: example,
The scalar version wors just out of the box, however, the SSE version gives errors wrt inlining such as given below
From the description I could not figure out what I am doing wrong,

I run Linux (Fedora 41) on a laptop with an X86_64 AMD Ryzen processor

The reason I was looking at the examples is that I want to replace the viterbi handling in my Qt-DAB software - now using spiral code - by
an instance of your library which would make the code much cleaner,

Any suggestion would be welcome

jan

usr/lib/gcc/x86_64-redhat-linux/14/include/tmmintrin.h:215:1: fout: inlining failed in call to ‘always_inline’ ‘__m128i _mm_abs_epi16(__m128i)’: target specific option mismatch
215 | _mm_abs_epi16 (__m128i __X)
| ^~~~~~~~~~~~~
/home/jan/ViterbiDecoderCpp/include/viterbi/x86/viterbi_decoder_sse_u16.h:95:38: note: van hieruit opgeroepen
95 | error = _mm_abs_epi16(error);
| ~~~~~~~~~~~~~^~~~~~~
/usr/lib/gcc/x86_64-redhat-linux/14/include/tmmintrin.h:215:1: fout: inlining failed in call to ‘always_inline’ ‘__m128i _mm_abs_epi16(__m128i)’: target specific option mismatch
215 | _mm_abs_epi16 (__m128i __X)
| ^~~~~~~~~~~~~

williamyang98 · 2025-01-27T05:43:39Z

Reproduction

CMake configuration command: cmake -B build
CMake compile command: cmake --build build

Explanation

It's probable that the architecture flags set for the C++ compiler by default weren't enough to enable _mm_abs_epi16 which is a SSSE3 instruction. Here's a stackoverflow post with a similar issue. By default gcc will use SSE2 which doesn't include those instructions, which is done to support as many processors as possible.

Possible fix

1. Manually setting build flags

If you are configuring cmake with gcc the command could look like this to enable all SIMD intrinsics.
cmake . -B build -DCMAKE_CXX_FLAGS="-march=native"

-march=native option compiles with all features available on your CPU which might include up to AVX512.
Might be a really bad idea if your CPU has AVX512 since the final binary will include AVX512 instructions that don't work on most processors more than a few years old (you will get illegal instruction errors if a user runs this on their less capable machine).

If you just want to compile only for SSEx then SSE4.1 seems like a good choice to include SSSE3 instructions.
cmake . -B build -DCMAKE_CXX_FLAGS="-msse4.1".

Should support the vast majority of CPUs made after 2010.

2. Using CMakePresets.json

CMake also allows you to create compilation presets to store these inside a JSON file, which might be preferred if you plan on supporting multiple compilers. (Right now the CMakePresets.json in the /examples folder uses -march=native for local and CI testing).

ViterbiDecoderCpp/examples/CMakePresets.json

Line 35 in 1758604

"CMAKE_CXX_FLAGS_INIT": "-ffast-math -march=native",

You can configure cmake with a preset by running:
cmake . -B build --preset gcc, where gcc can be replaced by the other presets in the file.

Additional notes

Patch fix for run_simple

Just submitted a fix for a compilation error in the run_simple.cpp example if you tried to uncomment the other decoders. (36de065)

Use of decoder in your own project

Also I made my own DAB SDR decoder after seeing your qt-dab project a few years ago, it's what got me interested enough to pursue any of this so thank you very much 😄!!!.

Here's the Viterbi decoder being used in my project if you need a code snippet:
https://github.com/williamyang98/DAB-Radio/blob/fc672043e35105f85d47a4766269565dbafdbc12/src/dab/algorithms/dab_viterbi_decoder.cpp

It uses #defines to use compile with different decoders at compile time from simd_flags.h.

Replacing compile time based decoder with runtime dispatch

If you want a single executable that has decoders for SSE2, AVX, AVX2 then you need something to dynamically switch between them at runtime to use the fastest supported decoder. This should be a better user experience since users get the fastest decoder and only need 1 download link. I personally have not been able to figure out how to do this, but here are some links if you want to try it:

GCC multi-versioning: In built feature to dispatch variants at a per function basis. Might also be available in clang which is supported on Windows?
google/cpu_features: Produces CPU flags which can be checked at runtime.
gnuradio/volk: Does runtime dispatch after code generation.

CMake targets are generated by python scripts. arch.xml is parsed to get flags each kernel variant.
Compiles each kernel variant as an object file with differrent SIMD flags (this way SSE2 kernels don't get compiled with AVX2 instructions, etc...)

CMake configure flags

Ninja build file where kernels have different compiler flags

JvanKatwijk · 2025-01-27T19:23:14Z

Thanks for the detailed answer. In the meantime I was already experimeting with additional command line parameters for the compilation. Using -march=x86-64-v2 seems to work fine. I took "inspiration" from the use of your header only library in your DAB-Radio and managed to get that running as well. With the given parameter -march=x86-64-v2 it compiles and runs file on my laptop with a Ryzen series 5000 processor and a laptop with an intel Core I5 processor, both with Linux and (cross compiled) for both 32 bit and 64 bit Windows. Since Qt-DAB is used mainly on Windows computers, it is important to be able to distribute something that runs on normal PC's and laptops. I am not an expert in cou's but it seems that SSE_4 is available in all intel and AMD processors sine at least a decade. The reason I wanted to replace the spiral version i simple, that code is quite messy. The Qt-DAB project is something I started just to get some feeling for digital radio, and I still use it as a vehicle to experiment a little with programming styles and choice of algorithms, so I am always interetsted in "other" approacheds and other algorithms. I quite often profile the software and of course the viterbi decoder implementation is not the major cycle eater, When running a number of times with the same (file) input and the same service selected. for a period of 2 minutes, I found the scalar version uses app 14 % of the processor time, which is roughly 2.5 seconds, running with the SSE option enabled, it reduces to app 2.5 % and a cpu time of 0.34 seconds (again measured over a 2 minutes period) Since I am using your library, I ve taken the liberty of adding a reference to the library page in the "about" text of Qt-DAB, Anyway, thanks for the detailed answer. it inspires me to continue to improve the implementation. Best jan Op ma 27 jan 2025 om 06:44 schreef William Yang ***@***.***>:

…

Reproduction 1. CMake configuration command: cmake -B build 2. CMake compile command: cmake --build build image.png (view on web) <https://github.com/user-attachments/assets/1f633c11-8608-4f2d-ac9b-55ccf2370307> Explanation It's probable that the architecture flags set for the C++ compiler by default weren't enough to enable _mm_abs_epi16 which is a SSSE3 instruction. Here's a stackoverflow post with a similar issue <https://stackoverflow.com/questions/43128698/inlining-failed-in-call-to-always-inline-mm-mullo-epi32-target-specific-opti>. By default gcc will use SSE2 which doesn't include those instructions, which is done to support as many processors as possible. image.png (view on web) <https://github.com/user-attachments/assets/2125d353-caa9-4b7c-a96d-b6792ea0dbe1> Possible fix 1. Manually setting build flags If you are configuring cmake with *gcc* the command could look like this to enable all SIMD intrinsics. cmake . -B build -DCMAKE_CXX_FLAGS="-march=native" - -march=native option compiles with all features available on your CPU which might include up to AVX512. - Might be a really bad idea if your CPU has AVX512 since the final binary will include AVX512 instructions that don't work on most processors more than a few years old (you will get illegal instruction errors if a user runs this on their less capable machine). If you just want to compile only for SSEx then SSE4.1 seems like a good choice to include SSSE3 instructions. cmake . -B build -DCMAKE_CXX_FLAGS="-msse4.1". - Should support the vast majority of CPUs made after 2010. 2. Using CMakePresets.json CMake also allows you to create compilation presets to store these inside a JSON file, which might be preferred if you plan on supporting multiple compilers. (Right now the CMakePresets.json in the /examples folder uses -march=native for local and CI testing). https://github.com/williamyang98/ViterbiDecoderCpp/blob/175860412203084ef6b9571ced1a07db56848b4d/examples/CMakePresets.json#L35C1-L35C61 You can configure cmake with a preset by running: cmake . -B build --preset gcc, where gcc can be replaced by the other presets in the file. Additional notes Patch fix for run_simple Just submitted a fix for a compilation error in the run_simple.cpp example if you tried to uncomment the other decoders. (36de065 <36de065> ) Use of decoder in your own project Also I made my own DAB SDR decoder <https://github.com/williamyang98/DAB-Radio> after seeing your qt-dab project <https://github.com/JvanKatwijk/qt-dab/tree/master> a few years ago, it's what got me interested enough to pursue any of this so thank you very much 😄!!!. Here's the Viterbi decoder being used in my project if you need a code snippet: https://github.com/williamyang98/DAB-Radio/blob/fc672043e35105f85d47a4766269565dbafdbc12/src/dab/algorithms/dab_viterbi_decoder.cpp - It uses #defines to use compile with different decoders at compile time from simd_flags.h. Replacing compile time based decoder with runtime dispatch If you want a single executable that has decoders for SSE2, AVX, AVX2 then you need something to dynamically switch between them at runtime to use the fastest supported decoder. This should be a better user experience since users get the fastest decoder and only need 1 download link. I personally have not been able to figure out how to do this, but here are some links if you want to try it: 1. GCC multi-versioning <https://lwn.net/Articles/691932/>: In built feature to dispatch variants at a per function basis. Might also be available in clang which is supported on Windows? 2. google/cpu_features <https://github.com/google/cpu_features/tree/ba4bffa86cbb5456bdb34426ad22b9551278e2c0?tab=readme-ov-file#codesample>: Produces CPU flags which can be checked at runtime. 3. gnuradio/volk <https://github.com/gnuradio/volk/tree/main>: Does runtime dispatch after code generation. - CMake targets are generated by python scripts. arch.xml <https://github.com/gnuradio/volk/blob/444951ee754f5c6e01c357a58ec6eb01cab8f943/gen/archs.xml> is parsed to get flags each kernel variant. - Compiles each kernel variant as an object file with differrent SIMD flags (this way SSE2 kernels don't get compiled with AVX2 instructions, etc...) *CMake configure flags* image.png (view on web) <https://github.com/user-attachments/assets/24bb30aa-7b8c-413a-815d-0e1e55758904> *Ninja build file where kernels have different compiler flags* image.png (view on web) <https://github.com/user-attachments/assets/365d80bc-129c-4eff-a55b-ae8039c25ede> — Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACCPHQEUQGLFJSOCBUFUFHD2MXBSDAVCNFSM6AAAAABVUTSLEKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMJUHA4TCMZXG4> . You are receiving this because you authored the thread.Message ID: ***@***.***>

-- Jan van Katwijk

JvanKatwijk · 2025-01-29T17:43:24Z

Just triggered by your response I looked into some gcc documentation and found that gcc povides quite a number of "built-in" functions to detect cpu and properties. So, I added a few lines of code #ifdef __ARCH_X86__ __builtin_cpu_init (); int has_avg2 = __builtin_cpu_supports ("avx2") != 0 ? AVX_SUPPORT : 0; int has_sse4 = __builtin_cpu_supports ("sse4.1") != 0 ? SSE_SUPPORT : 0; cpuSupport = has_avg2 + has_sse4; #endif and dispatched the "right" decoder element in the viterbi decoder. It seems to work on my Linux development box and - when ceoss compiled for windows - I can see the selected support (I read somewhere that avx2 is somewhat "stronger" than sse4, so the dispatch order is avx2 first, then sse, then scalar best jan Op ma 27 jan 2025 om 20:22 schreef jan van katwijk ***@***.***>:

…

Thanks for the detailed answer. In the meantime I was already experimeting with additional command line parameters for the compilation. Using -march=x86-64-v2 seems to work fine. I took "inspiration" from the use of your header only library in your DAB-Radio and managed to get that running as well. With the given parameter -march=x86-64-v2 it compiles and runs file on my laptop with a Ryzen series 5000 processor and a laptop with an intel Core I5 processor, both with Linux and (cross compiled) for both 32 bit and 64 bit Windows. Since Qt-DAB is used mainly on Windows computers, it is important to be able to distribute something that runs on normal PC's and laptops. I am not an expert in cou's but it seems that SSE_4 is available in all intel and AMD processors sine at least a decade. The reason I wanted to replace the spiral version i simple, that code is quite messy. The Qt-DAB project is something I started just to get some feeling for digital radio, and I still use it as a vehicle to experiment a little with programming styles and choice of algorithms, so I am always interetsted in "other" approacheds and other algorithms. I quite often profile the software and of course the viterbi decoder implementation is not the major cycle eater, When running a number of times with the same (file) input and the same service selected. for a period of 2 minutes, I found the scalar version uses app 14 % of the processor time, which is roughly 2.5 seconds, running with the SSE option enabled, it reduces to app 2.5 % and a cpu time of 0.34 seconds (again measured over a 2 minutes period) Since I am using your library, I ve taken the liberty of adding a reference to the library page in the "about" text of Qt-DAB, Anyway, thanks for the detailed answer. it inspires me to continue to improve the implementation. Best jan Op ma 27 jan 2025 om 06:44 schreef William Yang ***@***.*** >: > Reproduction > > 1. CMake configuration command: cmake -B build > 2. CMake compile command: cmake --build build > > image.png (view on web) > <https://github.com/user-attachments/assets/1f633c11-8608-4f2d-ac9b-55ccf2370307> > Explanation > > It's probable that the architecture flags set for the C++ compiler by > default weren't enough to enable _mm_abs_epi16 which is a SSSE3 > instruction. Here's a stackoverflow post with a similar issue > <https://stackoverflow.com/questions/43128698/inlining-failed-in-call-to-always-inline-mm-mullo-epi32-target-specific-opti>. > By default gcc will use SSE2 which doesn't include those instructions, > which is done to support as many processors as possible. > > image.png (view on web) > <https://github.com/user-attachments/assets/2125d353-caa9-4b7c-a96d-b6792ea0dbe1> > Possible fix 1. Manually setting build flags > > If you are configuring cmake with *gcc* the command could look like this > to enable all SIMD intrinsics. > cmake . -B build -DCMAKE_CXX_FLAGS="-march=native" > > - -march=native option compiles with all features available on your > CPU which might include up to AVX512. > - Might be a really bad idea if your CPU has AVX512 since the final > binary will include AVX512 instructions that don't work on most processors > more than a few years old (you will get illegal instruction errors if a > user runs this on their less capable machine). > > If you just want to compile only for SSEx then SSE4.1 seems like a good > choice to include SSSE3 instructions. > cmake . -B build -DCMAKE_CXX_FLAGS="-msse4.1". > > - Should support the vast majority of CPUs made after 2010. > > 2. Using CMakePresets.json > > CMake also allows you to create compilation presets to store these inside > a JSON file, which might be preferred if you plan on supporting multiple > compilers. (Right now the CMakePresets.json in the /examples folder uses > -march=native for local and CI testing). > > > https://github.com/williamyang98/ViterbiDecoderCpp/blob/175860412203084ef6b9571ced1a07db56848b4d/examples/CMakePresets.json#L35C1-L35C61 > > You can configure cmake with a preset by running: > cmake . -B build --preset gcc, where gcc can be replaced by the other > presets in the file. > Additional notes Patch fix for run_simple > > Just submitted a fix for a compilation error in the run_simple.cpp > example if you tried to uncomment the other decoders. (36de065 > <36de065> > ) > Use of decoder in your own project > > Also I made my own DAB SDR decoder > <https://github.com/williamyang98/DAB-Radio> after seeing your qt-dab > project <https://github.com/JvanKatwijk/qt-dab/tree/master> a few years > ago, it's what got me interested enough to pursue any of this so thank you > very much 😄!!!. > > Here's the Viterbi decoder being used in my project if you need a code > snippet: > > https://github.com/williamyang98/DAB-Radio/blob/fc672043e35105f85d47a4766269565dbafdbc12/src/dab/algorithms/dab_viterbi_decoder.cpp > > - It uses #defines to use compile with different decoders at compile > time from simd_flags.h. > > Replacing compile time based decoder with runtime dispatch > > If you want a single executable that has decoders for SSE2, AVX, AVX2 > then you need something to dynamically switch between them at runtime to > use the fastest supported decoder. This should be a better user experience > since users get the fastest decoder and only need 1 download link. I > personally have not been able to figure out how to do this, but here are > some links if you want to try it: > > 1. > > GCC multi-versioning <https://lwn.net/Articles/691932/>: In built > feature to dispatch variants at a per function basis. Might also be > available in clang which is supported on Windows? > 2. > > google/cpu_features > <https://github.com/google/cpu_features/tree/ba4bffa86cbb5456bdb34426ad22b9551278e2c0?tab=readme-ov-file#codesample>: > Produces CPU flags which can be checked at runtime. > 3. > > gnuradio/volk <https://github.com/gnuradio/volk/tree/main>: Does > runtime dispatch after code generation. > > > - CMake targets are generated by python scripts. arch.xml > <https://github.com/gnuradio/volk/blob/444951ee754f5c6e01c357a58ec6eb01cab8f943/gen/archs.xml> > is parsed to get flags each kernel variant. > - Compiles each kernel variant as an object file with differrent SIMD > flags (this way SSE2 kernels don't get compiled with AVX2 instructions, > etc...) > > *CMake configure flags* > image.png (view on web) > <https://github.com/user-attachments/assets/24bb30aa-7b8c-413a-815d-0e1e55758904> > > *Ninja build file where kernels have different compiler flags* > > image.png (view on web) > <https://github.com/user-attachments/assets/365d80bc-129c-4eff-a55b-ae8039c25ede> > > — > Reply to this email directly, view it on GitHub > <#3 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ACCPHQEUQGLFJSOCBUFUFHD2MXBSDAVCNFSM6AAAAABVUTSLEKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMJUHA4TCMZXG4> > . > You are receiving this because you authored the thread.Message ID: > ***@***.***> > -- Jan van Katwijk

-- Jan van Katwijk

williamyang98 · 2025-01-30T03:45:07Z

I've had issues with the final executable containing AVX/AVX2 code in places other than the decoder, which causes an illegal instruction error on SSE3 and less machines. As long as you only compile the decoder with AVX2 instructions (maybe in a separate object file with special compilation flags: -mavx2 for gcc/clang, /arch:AVX2 for msvc), and the rest of your code with SSE3 instructions, the final executable should be able to run on older and newer machines. Best of luck.

Just triggered by your response I looked into some gcc documentation and
found that gcc povides quite a number of "built-in" functions to detect cpu
and properties.
So, I added a few lines of code
#ifdef ARCH_X86
__builtin_cpu_init ();
int has_avg2 =
__builtin_cpu_supports ("avx2") != 0 ? AVX_SUPPORT : 0;
int has_sse4
= __builtin_cpu_supports ("sse4.1") != 0 ? SSE_SUPPORT : 0;
cpuSupport = has_avg2 + has_sse4;
#endif

and dispatched the "right" decoder element in the viterbi decoder. It
seems to work on my Linux development box and - when ceoss compiled for
windows -
I can see the selected support (I read somewhere that avx2 is somewhat
"stronger" than sse4, so the dispatch order is avx2 first, then sse, then
scalar

best
jan

Op ma 27 jan 2025 om 20:22 schreef jan van katwijk @.***>:
…

JvanKatwijk · 2025-01-30T19:47:52Z

I asked a few colleues to test it on their machine, I read somewhere that avx2 is in processors for the past 10 years, anyway it is an interesting experiment best jan Op do 30 jan 2025 om 04:45 schreef William Yang ***@***.***>:

…

I've had issues with the final executable containing AVX/AVX2 code in places other than the decoder, which causes an illegal instruction error on SSE3 and less machines. As long as you only compile the decoder with AVX2 instructions (maybe in a separate object file with special compilation flags: -mavx2 for gcc/clang, /arch:AVX2 for msvc), and the rest of your code with SSE3 instructions, the final executable should be able to run on older and newer machines. Best of luck. Just triggered by your response I looked into some gcc documentation and found that gcc povides quite a number of "built-in" functions to detect cpu and properties. So, I added a few lines of code #ifdef *ARCH_X86* __builtin_cpu_init (); int has_avg2 = __builtin_cpu_supports ("avx2") != 0 ? AVX_SUPPORT : 0; int has_sse4 = __builtin_cpu_supports ("sse4.1") != 0 ? SSE_SUPPORT : 0; cpuSupport = has_avg2 + has_sse4; #endif and dispatched the "right" decoder element in the viterbi decoder. It seems to work on my Linux development box and - when ceoss compiled for windows - I can see the selected support (I read somewhere that avx2 is somewhat "stronger" than sse4, so the dispatch order is avx2 first, then sse, then scalar best jan Op ma 27 jan 2025 om 20:22 schreef jan van katwijk *@*.***>: … <#m_2823931154134969096_> — Reply to this email directly, view it on GitHub <#3 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACCPHQD3FFI3UWSABMJ4END2NGN5RAVCNFSM6AAAAABVUTSLEKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMRTGQ2TEOJTHE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

-- Jan van Katwijk

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

inlining on x86_64 #3

inlining on x86_64 #3

JvanKatwijk commented Jan 22, 2025

williamyang98 commented Jan 27, 2025 •

edited

Loading

JvanKatwijk commented Jan 27, 2025 via email

JvanKatwijk commented Jan 29, 2025 via email

williamyang98 commented Jan 30, 2025

JvanKatwijk commented Jan 30, 2025 via email

inlining on x86_64 #3

inlining on x86_64 #3

Comments

JvanKatwijk commented Jan 22, 2025

williamyang98 commented Jan 27, 2025 • edited Loading

Reproduction

Explanation

Possible fix

1. Manually setting build flags

2. Using CMakePresets.json

Additional notes

Patch fix for run_simple

Use of decoder in your own project

Replacing compile time based decoder with runtime dispatch

JvanKatwijk commented Jan 27, 2025 via email

JvanKatwijk commented Jan 29, 2025 via email

williamyang98 commented Jan 30, 2025

JvanKatwijk commented Jan 30, 2025 via email

williamyang98 commented Jan 27, 2025 •

edited

Loading