Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inlining on x86_64 #3

Open
JvanKatwijk opened this issue Jan 22, 2025 · 5 comments
Open

inlining on x86_64 #3

JvanKatwijk opened this issue Jan 22, 2025 · 5 comments

Comments

@JvanKatwijk
Copy link

In getting some understanding of the package I was running the "run_simple: example,
The scalar version wors just out of the box, however, the SSE version gives errors wrt inlining such as given below
From the description I could not figure out what I am doing wrong,

I run Linux (Fedora 41) on a laptop with an X86_64 AMD Ryzen processor

The reason I was looking at the examples is that I want to replace the viterbi handling in my Qt-DAB software - now using spiral code - by
an instance of your library which would make the code much cleaner,

Any suggestion would be welcome

jan

usr/lib/gcc/x86_64-redhat-linux/14/include/tmmintrin.h:215:1: fout: inlining failed in call to ‘always_inline’ ‘__m128i _mm_abs_epi16(__m128i)’: target specific option mismatch
215 | _mm_abs_epi16 (__m128i __X)
| ^~~~~~~~~~~~~
/home/jan/ViterbiDecoderCpp/include/viterbi/x86/viterbi_decoder_sse_u16.h:95:38: note: van hieruit opgeroepen
95 | error = _mm_abs_epi16(error);
| ~~~~~~~~~~~~~^~~~~~~
/usr/lib/gcc/x86_64-redhat-linux/14/include/tmmintrin.h:215:1: fout: inlining failed in call to ‘always_inline’ ‘__m128i _mm_abs_epi16(__m128i)’: target specific option mismatch
215 | _mm_abs_epi16 (__m128i __X)
| ^~~~~~~~~~~~~

@williamyang98
Copy link
Owner

williamyang98 commented Jan 27, 2025

Reproduction

  1. CMake configuration command: cmake -B build
  2. CMake compile command: cmake --build build

Image

Explanation

It's probable that the architecture flags set for the C++ compiler by default weren't enough to enable _mm_abs_epi16 which is a SSSE3 instruction. Here's a stackoverflow post with a similar issue. By default gcc will use SSE2 which doesn't include those instructions, which is done to support as many processors as possible.

Image

Possible fix

1. Manually setting build flags

If you are configuring cmake with gcc the command could look like this to enable all SIMD intrinsics.
cmake . -B build -DCMAKE_CXX_FLAGS="-march=native"

  • -march=native option compiles with all features available on your CPU which might include up to AVX512.
  • Might be a really bad idea if your CPU has AVX512 since the final binary will include AVX512 instructions that don't work on most processors more than a few years old (you will get illegal instruction errors if a user runs this on their less capable machine).

If you just want to compile only for SSEx then SSE4.1 seems like a good choice to include SSSE3 instructions.
cmake . -B build -DCMAKE_CXX_FLAGS="-msse4.1".

  • Should support the vast majority of CPUs made after 2010.

2. Using CMakePresets.json

CMake also allows you to create compilation presets to store these inside a JSON file, which might be preferred if you plan on supporting multiple compilers. (Right now the CMakePresets.json in the /examples folder uses -march=native for local and CI testing).

"CMAKE_CXX_FLAGS_INIT": "-ffast-math -march=native",

You can configure cmake with a preset by running:
cmake . -B build --preset gcc, where gcc can be replaced by the other presets in the file.

Additional notes

Patch fix for run_simple

Just submitted a fix for a compilation error in the run_simple.cpp example if you tried to uncomment the other decoders. (36de065)

Use of decoder in your own project

Also I made my own DAB SDR decoder after seeing your qt-dab project a few years ago, it's what got me interested enough to pursue any of this so thank you very much 😄!!!.

Here's the Viterbi decoder being used in my project if you need a code snippet:
https://github.com/williamyang98/DAB-Radio/blob/fc672043e35105f85d47a4766269565dbafdbc12/src/dab/algorithms/dab_viterbi_decoder.cpp

  • It uses #defines to use compile with different decoders at compile time from simd_flags.h.

Replacing compile time based decoder with runtime dispatch

If you want a single executable that has decoders for SSE2, AVX, AVX2 then you need something to dynamically switch between them at runtime to use the fastest supported decoder. This should be a better user experience since users get the fastest decoder and only need 1 download link. I personally have not been able to figure out how to do this, but here are some links if you want to try it:

  1. GCC multi-versioning: In built feature to dispatch variants at a per function basis. Might also be available in clang which is supported on Windows?

  2. google/cpu_features: Produces CPU flags which can be checked at runtime.

  3. gnuradio/volk: Does runtime dispatch after code generation.

  • CMake targets are generated by python scripts. arch.xml is parsed to get flags each kernel variant.
  • Compiles each kernel variant as an object file with differrent SIMD flags (this way SSE2 kernels don't get compiled with AVX2 instructions, etc...)

CMake configure flags
Image

Ninja build file where kernels have different compiler flags

Image

@JvanKatwijk
Copy link
Author

JvanKatwijk commented Jan 27, 2025 via email

@JvanKatwijk
Copy link
Author

JvanKatwijk commented Jan 29, 2025 via email

@williamyang98
Copy link
Owner

I've had issues with the final executable containing AVX/AVX2 code in places other than the decoder, which causes an illegal instruction error on SSE3 and less machines. As long as you only compile the decoder with AVX2 instructions (maybe in a separate object file with special compilation flags: -mavx2 for gcc/clang, /arch:AVX2 for msvc), and the rest of your code with SSE3 instructions, the final executable should be able to run on older and newer machines. Best of luck.

Just triggered by your response I looked into some gcc documentation and
found that gcc povides quite a number of "built-in" functions to detect cpu
and properties.
So, I added a few lines of code
#ifdef ARCH_X86
__builtin_cpu_init ();
int has_avg2 =
__builtin_cpu_supports ("avx2") != 0 ? AVX_SUPPORT : 0;
int has_sse4
= __builtin_cpu_supports ("sse4.1") != 0 ? SSE_SUPPORT : 0;
cpuSupport = has_avg2 + has_sse4;
#endif

and dispatched the "right" decoder element in the viterbi decoder. It
seems to work on my Linux development box and - when ceoss compiled for
windows -
I can see the selected support (I read somewhere that avx2 is somewhat
"stronger" than sse4, so the dispatch order is avx2 first, then sse, then
scalar

best
jan

Op ma 27 jan 2025 om 20:22 schreef jan van katwijk @.***>:

@JvanKatwijk
Copy link
Author

JvanKatwijk commented Jan 30, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants