smp: add a function that barriers memory prefault work #2608

tomershafir · 2025-01-06T14:10:04Z

Currently, memory prefault logic is internal and seastar doesnt provide much control to users. In order to improve the situation, I suggest to provide a barrier for the prefault threads. This allows to:

Prefer predictable low latency and high throughput from the start of request serving, at the cost of a startup delay, depending on machine characteristics and application specific requirements. For example, a fixed capacity on prem db setup, where slower startup can be tolerated. From users perspective, they generally cannot tolerate inconsistency (like spikes in latency).
Similarly, improve user scheduling decisions, like running less critical tasks while prefault works.
Reliably test the prefault logic, improving reliability and users trust in seastar.

I tested locally. If you approve, next I will try to submit a prefault test.

Currently, memory prefault logic is internal and seastar doesnt provide much control to users. In order to improve the situation, I suggest to provide a barrier for the prefault threads. This allows to: * Prefer predictable low latency and high throughput from the start of request serving, at the cost of a startup delay, depending on machine characteristics and application specific requirements. For example, a fixed capacity on prem db setup, where slower startup can be tolerated. From users perspective, they generally cannot tolerate inconsistency (like spikes in latency). * Similarly, improve user scheduling decisions, like running less critical tasks while prefault works. * Reliably test the prefault logic, improving reliability and users trust in seastar. * Release memory_prefaulter::_worker_threads early and remove this overhead, rather than only at exit. I tested locally. If you approve this change, next I will submit a prefault test.

avikivity · 2025-01-06T14:44:49Z

Did you observe latency impact from the prefault threads? It was written carefully not to have latency impact, but it's of course possible that some workloads suffer.

tomershafir · 2025-01-06T20:53:49Z

As you described in #1702, page faults can cause deviation, and following up the example, there can be 25sec where latency is variably higher.

avikivity · 2025-01-07T11:20:20Z

As you described in #1702, page faults can cause deviation, and following up the example, there can be 25sec where latency is variably higher.

I said nothing about latency being higher there.

We typically run large machines with a few vcpus not assigned to any shards, and the prefault threads run with low priority.

tomershafir · 2025-01-07T14:25:46Z

There are 2 aspects:

Page faults

In the previous comment, I meant page fault latency. The page faults can cause high latency unpredictably until the prefaulter finishes.

Regarding page faults measurement, it seems I cannot reliably measure on my env.

Prefault threads competition

I tried to non scientifically isolate wall time overhead of prefault threads:

I have a test app that performs file I/O and process memory buffers repeatedly. I used Ubuntu Orbstack VM with 1 NUMA node, 10 cores, --memory=14G - effectively a small NUMA node, and a small input to let the overhead be most visible.

With --lock-memory=1 without waiting, I see that the chrono time of the actual work is significantly higher than with --lock-memory=0. (~1800ms > ~600ms)
When waiting before doing actual work, I see that the overhead is removed.
When building seastar without prefault code and --lock-memory=1 I dont see the overhead.

tomershafir · 2025-01-07T15:14:20Z

By default seastar uses all vcpus, which makes sense for resource efficiency.

Also, do you free specific vcpus? Like one per numa node, the granularity of prefault threads.

avikivity · 2025-01-07T16:12:17Z

By default seastar uses all vcpus, which makes sense for resource efficiency.

Also, do you free specific vcpus? Like one per numa node, the granularity of prefault threads.

1 in 8, with NUMA awareness. They're allocated for kernel network processing. See perftune.py.

tomershafir · 2025-01-07T17:48:21Z

Nice. Let me know if this change makes sense to you

tomershafir · 2025-01-16T09:43:19Z

@avikivity ping

tomershafir · 2025-01-16T12:15:31Z

I also tried to simulate perftune with 1 free vcpu: --cpuset=0-8 given the above setup, and I still observe the overhead, even though its less (~1600ms).

tomershafir marked this pull request as ready for review January 6, 2025 14:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

smp: add a function that barriers memory prefault work #2608

smp: add a function that barriers memory prefault work #2608

tomershafir commented Jan 6, 2025 •

edited

Loading

avikivity commented Jan 6, 2025

tomershafir commented Jan 6, 2025

avikivity commented Jan 7, 2025

tomershafir commented Jan 7, 2025 •

edited

Loading

tomershafir commented Jan 7, 2025

avikivity commented Jan 7, 2025

tomershafir commented Jan 7, 2025

tomershafir commented Jan 16, 2025

tomershafir commented Jan 16, 2025

smp: add a function that barriers memory prefault work #2608

Are you sure you want to change the base?

smp: add a function that barriers memory prefault work #2608

Conversation

tomershafir commented Jan 6, 2025 • edited Loading

avikivity commented Jan 6, 2025

tomershafir commented Jan 6, 2025

avikivity commented Jan 7, 2025

tomershafir commented Jan 7, 2025 • edited Loading

tomershafir commented Jan 7, 2025

avikivity commented Jan 7, 2025

tomershafir commented Jan 7, 2025

tomershafir commented Jan 16, 2025

tomershafir commented Jan 16, 2025

tomershafir commented Jan 6, 2025 •

edited

Loading

tomershafir commented Jan 7, 2025 •

edited

Loading