-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hyperthreading, L2/L3 Cache sizes, CPU C States, NUMA #91
Comments
Matti @mattip, we have an AMD EPYC here, and I tested on that. I tested np.add float32 + scalar. I had an output buffer preallocated. So np.add(a, 5, out=c) At 1 million row length, one thread is TWICE as fast a two threads. This penalty is unexpected. From my testing the cut off was between 2 million vs 2.2 million. The AMD EPYC has a cache size of 16MB (input and output buffers). Why is AMD so poor when a second thread runs? I don’t know.. something about their cache design. Intel acts differently.. usually small penalty for additional thread. This is more complicated because we could do an add with a=b+c and b and c have not been used in a while and are outside the cache. Then additional threads would help, but we don't know a priori that b and c are not in the L3 cache. So we could detect AMD, and play it safe with threading engaging only for larger array lengths. |
Checked in code this morning to rework threading a little. It tries to divide the work into pools and channels. When working on the first thread pool, it will divide the array into 16384 chunks.
|
At work, in BIOS we disable hyperthreading and turn off the power saving CPU C states. If we have multiple NUMA nodes, we sometimes run one process on NUMA node 0, and the other on NUMA node 1.
At home when testing, hyperthreading is usually enabled and CPU power saving states are enabled. Modern processors sometimes have a Turbo mode, which only kicks in after so much activity, possibly complicating performance testing.
Whether entire arrays lives inside the L2 or L3 cache may further complicate performance testing.
To see if you have hyper threading on ("pip install psutil") then..
If the numbers are different, hyperthreading is turned on.
To see cache sizes and other information
On my home computer hyperthreading is turned on. I am in the process of writing code (completed for Windows, now working on Linux) to detect hyperthreading, read L1/L2/L3 cache sizes, read NUMA information, and change the way threading works based on this.
If hyperthreading is turned on, we will start as many threads as physical (not logical) and set a thread affinity for every other code to avoid clashing. So far in testing, this does appear to speed up calculations.
In the future, if we detect NUMA nodes, we can give the option to run on which NUMA node.
I am not sure yet how knowledge of cache sizes might help, but it will probably help determine how many threads to wake up for a given array size.
The text was updated successfully, but these errors were encountered: