Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FFTW with threads #48

Closed
ka9q opened this issue Dec 31, 2024 · 6 comments
Closed

FFTW with threads #48

ka9q opened this issue Dec 31, 2024 · 6 comments

Comments

@ka9q
Copy link

ka9q commented Dec 31, 2024

I see that by default dumphfdl builds with FFTW's internal multithreading option enabled, with 4 threads specified. Have you benchmarked this?

I also use FFTW heavily to perform fast convolution in ka9q-radio, and I found that internal multithreading didn't buy me much, at least with the huge FFTs I use (e.g., 1,620,000 points). Although it reduced the clock time required to perform a single FFT, the overall CPU utilization went up. Since I already perform a lot of independent FFTs in parallel in separate application threads I found it's better (for me) to have FFTW use only a single thread. Since I'm running lots of parallel copies of dumphfdl (fed by ka9q-radio channel threads) I went into my copy of fft_fftw.c and changed the number of threads to 1, leaving multithreading enabled.

I did this because of a gotcha. Wisdom files written with threads = 1 are NOT compatible with those written with multithreading completely turned off. This means you can't share a system-wide wisdom file (e.g., /etc/fftw/wisdomf) unless everybody agrees to use the same FFTW thread settings. There's a per-application wisdom file, but it doesn't look like you're using one. If you like, I could have it create one in /var/lib/hfdl/wisdom and send a pull request. I've already placed systable.conf in that directory, as this is the standard place in Linux to hold application-specific data files.

@dg9bja
Copy link

dg9bja commented Jan 1, 2025

Thank you for that information. I am running 8 instances (72 frequencies) of dumphfdl with a samplerate 192000 feeded by Red Pitaya (hpsdr). It reduced my cpu usage on my virtual machine.

@szpajder
Copy link
Owner

szpajder commented Jan 2, 2025 via email

@ka9q
Copy link
Author

ka9q commented Jan 9, 2025

So leave threading enabled but make the number of threads a runtime option. Don't turn them off completely, as that will generate wisdom files incompatible with programs that do thread. Setting the number of threads to 1 has essentially the same effect while allowing sharing of a system wisdom file with programs that want to use more.

6 Ms/s seemed high, but then I realized you're doing your own multichannel downconversion internally. I am currently running a separate copy of dumphfdl for every channel, 106 in total, with ka9q-radio doing the downconversion. This works well, and certainly creates a lot of parallelism, but I need to compare total CPU use against fewer but wider channels, perhaps one per band. I use a 12 ks/s IQ input for each SSB signal (8 ks/s didn't work) which means I need a higher total sample rate for a bunch of nearby HFDL channels than one wider channel covering them all. BTW, if you need an analytic signal you can create one with a half-plane filter using fast convolution. This is how I do it: start with a real-input FFT to create a complex spectrum with hermitian symmetry (negative spectrum is mirror image of positive spectrum). Then remove the negative frequencies (with windowing to prevent time-domain ripples) and convert back to the time domain with a complex-output FFT. This would permit feeding dumphfdl with a conventional SSB receiver.

KA9Q-radio uses fast convolution with a shared forward FFT to implement a multichannel digital downconverter. Even at a 64.8 (all of HF) or 129.6 Ms/s (HF-6m) A/D input sample rate I still use single-threaded FFTs, though I give the option to run several threads each performing independent FFTs, which is faster that multithreading individual FFTs. Usually even this isn't necessary; I have a NUC with an i5-8260U @ 1.60GHz doing 1.62 megapoint real-input FFTs 50 times/sec while using only ~40% of a single core. FFTW is amazing.

@ka9q
Copy link
Author

ka9q commented Jan 10, 2025

I just ran the experiment with per-band dumphfdl fed from ka9q-radio. It's faster than one per channel, but not dramatically so. I'm using 12 bands with a total sample rate of 1.388 Ms/s vs 106 individual channels @ 12 ks/channel = 1.272 Ms/s. I guess it works both ways.

@szpajder
Copy link
Owner

So leave threading enabled but make the number of threads a runtime option. Don't turn them off completely, as that will generate wisdom files incompatible with programs that do thread. Setting the number of threads to 1 has essentially the same effect while allowing sharing of a system wisdom file with programs that want to use more.

I've added --fft-threads <n> command line option to the unstable branch. The default is 4, as before.

@ka9q
Copy link
Author

ka9q commented Jan 12, 2025

Thanks, that will do it for me!

@ka9q ka9q closed this as completed Jan 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants