Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hyperthreading, L2/L3 Cache sizes, CPU C States, NUMA #91

Open
tdimitri opened this issue Nov 17, 2020 · 2 comments
Open

Hyperthreading, L2/L3 Cache sizes, CPU C States, NUMA #91

tdimitri opened this issue Nov 17, 2020 · 2 comments

Comments

@tdimitri
Copy link
Collaborator

At work, in BIOS we disable hyperthreading and turn off the power saving CPU C states. If we have multiple NUMA nodes, we sometimes run one process on NUMA node 0, and the other on NUMA node 1.

At home when testing, hyperthreading is usually enabled and CPU power saving states are enabled. Modern processors sometimes have a Turbo mode, which only kicks in after so much activity, possibly complicating performance testing.

Whether entire arrays lives inside the L2 or L3 cache may further complicate performance testing.

To see if you have hyper threading on ("pip install psutil") then..

In [1]: import psutil
In [2]: psutil.cpu_count(logical=False)
Out[2]: 6
In [3]: psutil.cpu_count(logical=True)
Out[3]: 12

If the numbers are different, hyperthreading is turned on.

To see cache sizes and other information

pip install py-cpuinfo
In [1]: import cpuinfo
In [2]: cpuinfo.get_cpu_info()
In [3]: cpuinfo.get_cpu_info()['l2_cache_size']
Out[3]: 1572864
In [4]: cpuinfo.get_cpu_info()['l3_cache_size']
Out[4]: 15728640

On my home computer hyperthreading is turned on. I am in the process of writing code (completed for Windows, now working on Linux) to detect hyperthreading, read L1/L2/L3 cache sizes, read NUMA information, and change the way threading works based on this.

If hyperthreading is turned on, we will start as many threads as physical (not logical) and set a thread affinity for every other code to avoid clashing. So far in testing, this does appear to speed up calculations.

In the future, if we detect NUMA nodes, we can give the option to run on which NUMA node.
I am not sure yet how knowledge of cache sizes might help, but it will probably help determine how many threads to wake up for a given array size.

@tdimitri
Copy link
Collaborator Author

Matti @mattip, we have an AMD EPYC here, and I tested on that.

I tested np.add float32 + scalar. I had an output buffer preallocated. So np.add(a, 5, out=c)
I used one additional worker thread.

At 1 million row length, one thread is TWICE as fast a two threads. This penalty is unexpected.
At 4 million row length, two threads are 60% faster vs one thread.

From my testing the cut off was between 2 million vs 2.2 million. The AMD EPYC has a cache size of 16MB (input and output buffers).

Why is AMD so poor when a second thread runs? I don’t know.. something about their cache design. Intel acts differently.. usually small penalty for additional thread.
It appears on AMD once the L3 cache is blown, additional threads do help. However, when inside the L3, additional threads cause a larger penalty than expected.

This is more complicated because we could do an add with a=b+c and b and c have not been used in a while and are outside the cache. Then additional threads would help, but we don't know a priori that b and c are not in the L3 cache.

So we could detect AMD, and play it safe with threading engaging only for larger array lengths.

@tdimitri
Copy link
Collaborator Author

tdimitri commented Dec 1, 2020

Checked in code this morning to rework threading a little. It tries to divide the work into pools and channels.
The first pool has 3 worker threads + main thread => 4 total threads. Each successive pool has 4 more worker threads.
For simple math, it will just (by default) activate the first thread pool. On Linux and Mac, it will not pin the threads. On Windows it will. On Windows it will detect hyperthreading, on Linux I have to add that code (likely this week).

When working on the first thread pool, it will divide the array into 16384 chunks.
The first chunk goes to main thread, second chunk to first worker thread, and so on.
So main thread handles chunks 0, 4, 8, 12, ..
First worker thread handles chunks 1, 5, 9, 13, ..
If a thread has completed its assigned chunks, it will scavenge to look for work in other threads.
This method works better with L1/L2 cache, and below is a speed on the AMD EPYC.
simple add is now 2.5x faster, sin is 7x faster

In [1]: import pnumpy as pn
In [2]: import numpy as np
In [3]: a=np.arange(1_000_000, dtype=np.float32)
In [4]: c=a+5
In [5]: %timeit np.add(a,5, out=c)
115 µs ± 749 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [6]: pn.initialize()
In [7]: %timeit np.add(a,5, out=c)
44.2 µs ± 781 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [8]: %timeit np.sin(a)
1.31 ms ± 8.19 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [9]: pn.thread_disable()
In [10]: %timeit np.sin(a)
9.45 ms ± 48 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [11]: pn.cpustring()
Out[11]: '**CPU: AMD EPYC 7742 64-Core Processor                  AVX2:1  BMI2:1 0x7ef8320b 0x178bfbff 0x219c91a9 0x00400004

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant