-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
inlining on x86_64 #3
Comments
Reproduction
ExplanationIt's probable that the architecture flags set for the C++ compiler by default weren't enough to enable Possible fix1. Manually setting build flagsIf you are configuring cmake with gcc the command could look like this to enable all SIMD intrinsics.
If you just want to compile only for SSEx then SSE4.1 seems like a good choice to include SSSE3 instructions.
2. Using CMakePresets.jsonCMake also allows you to create compilation presets to store these inside a JSON file, which might be preferred if you plan on supporting multiple compilers. (Right now the CMakePresets.json in the /examples folder uses
You can configure cmake with a preset by running: Additional notesPatch fix for run_simpleJust submitted a fix for a compilation error in the Use of decoder in your own projectAlso I made my own DAB SDR decoder after seeing your qt-dab project a few years ago, it's what got me interested enough to pursue any of this so thank you very much 😄!!!. Here's the Viterbi decoder being used in my project if you need a code snippet:
Replacing compile time based decoder with runtime dispatchIf you want a single executable that has decoders for SSE2, AVX, AVX2 then you need something to dynamically switch between them at runtime to use the fastest supported decoder. This should be a better user experience since users get the fastest decoder and only need 1 download link. I personally have not been able to figure out how to do this, but here are some links if you want to try it:
Ninja build file where kernels have different compiler flags |
Thanks for the detailed answer. In the meantime I was already experimeting
with additional command line parameters for the compilation.
Using -march=x86-64-v2 seems to work fine.
I took "inspiration" from the use of your header only library in your
DAB-Radio and managed to get that running as well.
With the given parameter -march=x86-64-v2 it compiles and runs file on my
laptop with a Ryzen series 5000 processor
and a laptop with an intel Core I5 processor, both with Linux and (cross
compiled) for both 32 bit and 64 bit Windows.
Since Qt-DAB is used mainly on Windows computers, it is important to be
able to distribute something that runs on normal PC's and laptops.
I am not an expert in cou's but it seems that SSE_4 is available in all
intel and AMD processors sine at least a decade.
The reason I wanted to replace the spiral version i simple, that code is
quite messy.
The Qt-DAB project is something I started just to get some feeling for
digital radio, and I still use it as a vehicle to
experiment a little with programming styles and choice of algorithms, so I
am always interetsted in "other" approacheds and
other algorithms.
I quite often profile the software and of course the viterbi decoder
implementation is not the major cycle eater,
When running a number of times with the same (file) input and the same
service selected. for a period of 2 minutes, I found
the scalar version uses app 14 % of the processor time, which is roughly
2.5 seconds,
running with the SSE option enabled, it reduces to app 2.5 % and a cpu time
of 0.34 seconds (again measured over a 2 minutes period)
Since I am using your library, I ve taken the liberty of adding a reference
to the library page in the "about" text of Qt-DAB,
Anyway, thanks for the detailed answer. it inspires me to continue to
improve the implementation.
Best
jan
Op ma 27 jan 2025 om 06:44 schreef William Yang ***@***.***>:
… Reproduction
1. CMake configuration command: cmake -B build
2. CMake compile command: cmake --build build
image.png (view on web)
<https://github.com/user-attachments/assets/1f633c11-8608-4f2d-ac9b-55ccf2370307>
Explanation
It's probable that the architecture flags set for the C++ compiler by
default weren't enough to enable _mm_abs_epi16 which is a SSSE3
instruction. Here's a stackoverflow post with a similar issue
<https://stackoverflow.com/questions/43128698/inlining-failed-in-call-to-always-inline-mm-mullo-epi32-target-specific-opti>.
By default gcc will use SSE2 which doesn't include those instructions,
which is done to support as many processors as possible.
image.png (view on web)
<https://github.com/user-attachments/assets/2125d353-caa9-4b7c-a96d-b6792ea0dbe1>
Possible fix 1. Manually setting build flags
If you are configuring cmake with *gcc* the command could look like this
to enable all SIMD intrinsics.
cmake . -B build -DCMAKE_CXX_FLAGS="-march=native"
- -march=native option compiles with all features available on your
CPU which might include up to AVX512.
- Might be a really bad idea if your CPU has AVX512 since the final
binary will include AVX512 instructions that don't work on most processors
more than a few years old (you will get illegal instruction errors if a
user runs this on their less capable machine).
If you just want to compile only for SSEx then SSE4.1 seems like a good
choice to include SSSE3 instructions.
cmake . -B build -DCMAKE_CXX_FLAGS="-msse4.1".
- Should support the vast majority of CPUs made after 2010.
2. Using CMakePresets.json
CMake also allows you to create compilation presets to store these inside
a JSON file, which might be preferred if you plan on supporting multiple
compilers. (Right now the CMakePresets.json in the /examples folder uses
-march=native for local and CI testing).
https://github.com/williamyang98/ViterbiDecoderCpp/blob/175860412203084ef6b9571ced1a07db56848b4d/examples/CMakePresets.json#L35C1-L35C61
You can configure cmake with a preset by running:
cmake . -B build --preset gcc, where gcc can be replaced by the other
presets in the file.
Additional notes Patch fix for run_simple
Just submitted a fix for a compilation error in the run_simple.cpp
example if you tried to uncomment the other decoders. (36de065
<36de065>
)
Use of decoder in your own project
Also I made my own DAB SDR decoder
<https://github.com/williamyang98/DAB-Radio> after seeing your qt-dab
project <https://github.com/JvanKatwijk/qt-dab/tree/master> a few years
ago, it's what got me interested enough to pursue any of this so thank you
very much 😄!!!.
Here's the Viterbi decoder being used in my project if you need a code
snippet:
https://github.com/williamyang98/DAB-Radio/blob/fc672043e35105f85d47a4766269565dbafdbc12/src/dab/algorithms/dab_viterbi_decoder.cpp
- It uses #defines to use compile with different decoders at compile
time from simd_flags.h.
Replacing compile time based decoder with runtime dispatch
If you want a single executable that has decoders for SSE2, AVX, AVX2 then
you need something to dynamically switch between them at runtime to use the
fastest supported decoder. This should be a better user experience since
users get the fastest decoder and only need 1 download link. I personally
have not been able to figure out how to do this, but here are some links if
you want to try it:
1.
GCC multi-versioning <https://lwn.net/Articles/691932/>: In built
feature to dispatch variants at a per function basis. Might also be
available in clang which is supported on Windows?
2.
google/cpu_features
<https://github.com/google/cpu_features/tree/ba4bffa86cbb5456bdb34426ad22b9551278e2c0?tab=readme-ov-file#codesample>:
Produces CPU flags which can be checked at runtime.
3.
gnuradio/volk <https://github.com/gnuradio/volk/tree/main>: Does
runtime dispatch after code generation.
- CMake targets are generated by python scripts. arch.xml
<https://github.com/gnuradio/volk/blob/444951ee754f5c6e01c357a58ec6eb01cab8f943/gen/archs.xml>
is parsed to get flags each kernel variant.
- Compiles each kernel variant as an object file with differrent SIMD
flags (this way SSE2 kernels don't get compiled with AVX2 instructions,
etc...)
*CMake configure flags*
image.png (view on web)
<https://github.com/user-attachments/assets/24bb30aa-7b8c-413a-815d-0e1e55758904>
*Ninja build file where kernels have different compiler flags*
image.png (view on web)
<https://github.com/user-attachments/assets/365d80bc-129c-4eff-a55b-ae8039c25ede>
—
Reply to this email directly, view it on GitHub
<#3 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACCPHQEUQGLFJSOCBUFUFHD2MXBSDAVCNFSM6AAAAABVUTSLEKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMJUHA4TCMZXG4>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
--
Jan van Katwijk
|
Just triggered by your response I looked into some gcc documentation and
found that gcc povides quite a number of "built-in" functions to detect cpu
and properties.
So, I added a few lines of code
#ifdef __ARCH_X86__
__builtin_cpu_init ();
int has_avg2 =
__builtin_cpu_supports ("avx2") != 0 ? AVX_SUPPORT : 0;
int has_sse4
= __builtin_cpu_supports ("sse4.1") != 0 ? SSE_SUPPORT : 0;
cpuSupport = has_avg2 + has_sse4;
#endif
and dispatched the "right" decoder element in the viterbi decoder. It
seems to work on my Linux development box and - when ceoss compiled for
windows -
I can see the selected support (I read somewhere that avx2 is somewhat
"stronger" than sse4, so the dispatch order is avx2 first, then sse, then
scalar
best
jan
Op ma 27 jan 2025 om 20:22 schreef jan van katwijk ***@***.***>:
… Thanks for the detailed answer. In the meantime I was already experimeting
with additional command line parameters for the compilation.
Using -march=x86-64-v2 seems to work fine.
I took "inspiration" from the use of your header only library in your
DAB-Radio and managed to get that running as well.
With the given parameter -march=x86-64-v2 it compiles and runs file on my
laptop with a Ryzen series 5000 processor
and a laptop with an intel Core I5 processor, both with Linux and (cross
compiled) for both 32 bit and 64 bit Windows.
Since Qt-DAB is used mainly on Windows computers, it is important to be
able to distribute something that runs on normal PC's and laptops.
I am not an expert in cou's but it seems that SSE_4 is available in all
intel and AMD processors sine at least a decade.
The reason I wanted to replace the spiral version i simple, that code is
quite messy.
The Qt-DAB project is something I started just to get some feeling for
digital radio, and I still use it as a vehicle to
experiment a little with programming styles and choice of algorithms, so I
am always interetsted in "other" approacheds and
other algorithms.
I quite often profile the software and of course the viterbi decoder
implementation is not the major cycle eater,
When running a number of times with the same (file) input and the same
service selected. for a period of 2 minutes, I found
the scalar version uses app 14 % of the processor time, which is roughly
2.5 seconds,
running with the SSE option enabled, it reduces to app 2.5 % and a cpu
time of 0.34 seconds (again measured over a 2 minutes period)
Since I am using your library, I ve taken the liberty of adding a
reference to the library page in the "about" text of Qt-DAB,
Anyway, thanks for the detailed answer. it inspires me to continue to
improve the implementation.
Best
jan
Op ma 27 jan 2025 om 06:44 schreef William Yang ***@***.***
>:
> Reproduction
>
> 1. CMake configuration command: cmake -B build
> 2. CMake compile command: cmake --build build
>
> image.png (view on web)
> <https://github.com/user-attachments/assets/1f633c11-8608-4f2d-ac9b-55ccf2370307>
> Explanation
>
> It's probable that the architecture flags set for the C++ compiler by
> default weren't enough to enable _mm_abs_epi16 which is a SSSE3
> instruction. Here's a stackoverflow post with a similar issue
> <https://stackoverflow.com/questions/43128698/inlining-failed-in-call-to-always-inline-mm-mullo-epi32-target-specific-opti>.
> By default gcc will use SSE2 which doesn't include those instructions,
> which is done to support as many processors as possible.
>
> image.png (view on web)
> <https://github.com/user-attachments/assets/2125d353-caa9-4b7c-a96d-b6792ea0dbe1>
> Possible fix 1. Manually setting build flags
>
> If you are configuring cmake with *gcc* the command could look like this
> to enable all SIMD intrinsics.
> cmake . -B build -DCMAKE_CXX_FLAGS="-march=native"
>
> - -march=native option compiles with all features available on your
> CPU which might include up to AVX512.
> - Might be a really bad idea if your CPU has AVX512 since the final
> binary will include AVX512 instructions that don't work on most processors
> more than a few years old (you will get illegal instruction errors if a
> user runs this on their less capable machine).
>
> If you just want to compile only for SSEx then SSE4.1 seems like a good
> choice to include SSSE3 instructions.
> cmake . -B build -DCMAKE_CXX_FLAGS="-msse4.1".
>
> - Should support the vast majority of CPUs made after 2010.
>
> 2. Using CMakePresets.json
>
> CMake also allows you to create compilation presets to store these inside
> a JSON file, which might be preferred if you plan on supporting multiple
> compilers. (Right now the CMakePresets.json in the /examples folder uses
> -march=native for local and CI testing).
>
>
> https://github.com/williamyang98/ViterbiDecoderCpp/blob/175860412203084ef6b9571ced1a07db56848b4d/examples/CMakePresets.json#L35C1-L35C61
>
> You can configure cmake with a preset by running:
> cmake . -B build --preset gcc, where gcc can be replaced by the other
> presets in the file.
> Additional notes Patch fix for run_simple
>
> Just submitted a fix for a compilation error in the run_simple.cpp
> example if you tried to uncomment the other decoders. (36de065
> <36de065>
> )
> Use of decoder in your own project
>
> Also I made my own DAB SDR decoder
> <https://github.com/williamyang98/DAB-Radio> after seeing your qt-dab
> project <https://github.com/JvanKatwijk/qt-dab/tree/master> a few years
> ago, it's what got me interested enough to pursue any of this so thank you
> very much 😄!!!.
>
> Here's the Viterbi decoder being used in my project if you need a code
> snippet:
>
> https://github.com/williamyang98/DAB-Radio/blob/fc672043e35105f85d47a4766269565dbafdbc12/src/dab/algorithms/dab_viterbi_decoder.cpp
>
> - It uses #defines to use compile with different decoders at compile
> time from simd_flags.h.
>
> Replacing compile time based decoder with runtime dispatch
>
> If you want a single executable that has decoders for SSE2, AVX, AVX2
> then you need something to dynamically switch between them at runtime to
> use the fastest supported decoder. This should be a better user experience
> since users get the fastest decoder and only need 1 download link. I
> personally have not been able to figure out how to do this, but here are
> some links if you want to try it:
>
> 1.
>
> GCC multi-versioning <https://lwn.net/Articles/691932/>: In built
> feature to dispatch variants at a per function basis. Might also be
> available in clang which is supported on Windows?
> 2.
>
> google/cpu_features
> <https://github.com/google/cpu_features/tree/ba4bffa86cbb5456bdb34426ad22b9551278e2c0?tab=readme-ov-file#codesample>:
> Produces CPU flags which can be checked at runtime.
> 3.
>
> gnuradio/volk <https://github.com/gnuradio/volk/tree/main>: Does
> runtime dispatch after code generation.
>
>
> - CMake targets are generated by python scripts. arch.xml
> <https://github.com/gnuradio/volk/blob/444951ee754f5c6e01c357a58ec6eb01cab8f943/gen/archs.xml>
> is parsed to get flags each kernel variant.
> - Compiles each kernel variant as an object file with differrent SIMD
> flags (this way SSE2 kernels don't get compiled with AVX2 instructions,
> etc...)
>
> *CMake configure flags*
> image.png (view on web)
> <https://github.com/user-attachments/assets/24bb30aa-7b8c-413a-815d-0e1e55758904>
>
> *Ninja build file where kernels have different compiler flags*
>
> image.png (view on web)
> <https://github.com/user-attachments/assets/365d80bc-129c-4eff-a55b-ae8039c25ede>
>
> —
> Reply to this email directly, view it on GitHub
> <#3 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/ACCPHQEUQGLFJSOCBUFUFHD2MXBSDAVCNFSM6AAAAABVUTSLEKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMJUHA4TCMZXG4>
> .
> You are receiving this because you authored the thread.Message ID:
> ***@***.***>
>
--
Jan van Katwijk
--
Jan van Katwijk
|
I've had issues with the final executable containing AVX/AVX2 code in places other than the decoder, which causes an illegal instruction error on SSE3 and less machines. As long as you only compile the decoder with AVX2 instructions (maybe in a separate object file with special compilation flags: -mavx2 for gcc/clang, /arch:AVX2 for msvc), and the rest of your code with SSE3 instructions, the final executable should be able to run on older and newer machines. Best of luck.
|
I asked a few colleues to test it on their machine, I read somewhere that
avx2 is in processors for the past 10 years,
anyway it is an interesting experiment
best
jan
Op do 30 jan 2025 om 04:45 schreef William Yang ***@***.***>:
… I've had issues with the final executable containing AVX/AVX2 code in
places other than the decoder, which causes an illegal instruction error on
SSE3 and less machines. As long as you only compile the decoder with AVX2
instructions (maybe in a separate object file with special compilation
flags: -mavx2 for gcc/clang, /arch:AVX2 for msvc), and the rest of your
code with SSE3 instructions, the final executable should be able to run on
older and newer machines. Best of luck.
Just triggered by your response I looked into some gcc documentation and
found that gcc povides quite a number of "built-in" functions to detect cpu
and properties.
So, I added a few lines of code
#ifdef *ARCH_X86*
__builtin_cpu_init ();
int has_avg2 =
__builtin_cpu_supports ("avx2") != 0 ? AVX_SUPPORT : 0;
int has_sse4
= __builtin_cpu_supports ("sse4.1") != 0 ? SSE_SUPPORT : 0;
cpuSupport = has_avg2 + has_sse4;
#endif
and dispatched the "right" decoder element in the viterbi decoder. It
seems to work on my Linux development box and - when ceoss compiled for
windows -
I can see the selected support (I read somewhere that avx2 is somewhat
"stronger" than sse4, so the dispatch order is avx2 first, then sse, then
scalar
best
jan
Op ma 27 jan 2025 om 20:22 schreef jan van katwijk *@*.***>:
… <#m_2823931154134969096_>
—
Reply to this email directly, view it on GitHub
<#3 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACCPHQD3FFI3UWSABMJ4END2NGN5RAVCNFSM6AAAAABVUTSLEKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMMRTGQ2TEOJTHE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
--
Jan van Katwijk
|
In getting some understanding of the package I was running the "run_simple: example,
The scalar version wors just out of the box, however, the SSE version gives errors wrt inlining such as given below
From the description I could not figure out what I am doing wrong,
I run Linux (Fedora 41) on a laptop with an X86_64 AMD Ryzen processor
The reason I was looking at the examples is that I want to replace the viterbi handling in my Qt-DAB software - now using spiral code - by
an instance of your library which would make the code much cleaner,
Any suggestion would be welcome
jan
usr/lib/gcc/x86_64-redhat-linux/14/include/tmmintrin.h:215:1: fout: inlining failed in call to ‘always_inline’ ‘__m128i _mm_abs_epi16(__m128i)’: target specific option mismatch
215 | _mm_abs_epi16 (__m128i __X)
| ^~~~~~~~~~~~~
/home/jan/ViterbiDecoderCpp/include/viterbi/x86/viterbi_decoder_sse_u16.h:95:38: note: van hieruit opgeroepen
95 | error = _mm_abs_epi16(error);
| ~~~~~~~~~~~~~^~~~~~~
/usr/lib/gcc/x86_64-redhat-linux/14/include/tmmintrin.h:215:1: fout: inlining failed in call to ‘always_inline’ ‘__m128i _mm_abs_epi16(__m128i)’: target specific option mismatch
215 | _mm_abs_epi16 (__m128i __X)
| ^~~~~~~~~~~~~
The text was updated successfully, but these errors were encountered: