Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce std.heap.SmpAllocator #22808

Merged
merged 13 commits into from
Feb 8, 2025
Merged

introduce std.heap.SmpAllocator #22808

merged 13 commits into from
Feb 8, 2025

Conversation

andrewrk
Copy link
Member

@andrewrk andrewrk commented Feb 7, 2025

An allocator that is designed for ReleaseFast optimization mode, with multi-threading enabled.

This allocator is a singleton; it uses global state and only one should be instantiated for the entire process.

This is a "sweet spot" - the implementation is about 200 lines of code and yet competitive with glibc performance.

Basic Design

Each thread gets a separate freelist, however, the data must be recoverable when the thread exits. We do not directly learn when a thread exits, so occasionally, one thread must attempt to reclaim another thread's resources.

Above a certain size, those allocations are memory mapped directly, with no storage of allocation metadata. This works because the implementation refuses resizes that would move an allocation from small category to large category or vice versa.

Each allocator operation checks the thread identifier from a threadlocal variable to find out which metadata in the global state to access, and attempts to grab its lock. This will usually succeed without contention, unless another thread has been assigned the same id. In the case of such contention, the thread moves on to the next thread metadata slot and repeats the process of attempting to obtain the lock.

By limiting the thread-local metadata array to the same number as the CPU count, ensures that as threads are created and destroyed, they cycle through the full set of freelists.

Performance Data Points

This is building hello world with glibc vs SmpAllocator:

  • master branch (0.14.0-dev.3145+6a6e72fff) stage3/bin/zig build -p glibc -Doptimize=ReleaseFast -Dno-lib -Dforce-link-libc
  • this branch, stage3/bin/zig build -p SmpAllocator -Doptimize=ReleaseFast -Dno-lib, which now uses SmpAllocator rather than DebugAllocator with this build configuration
Benchmark 1 (24 runs): glibc/bin/zig build-exe ../test/standalone/simple/hello_world/hello.zig
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           211ms ± 9.91ms     193ms …  237ms          4 (17%)        0%
  peak_rss           73.2MB ±  708KB    71.9MB … 74.3MB          0 ( 0%)        0%
  cpu_cycles         1.16G  ± 9.10M     1.14G  … 1.18G           0 ( 0%)        0%
  instructions       2.32G  ± 81.4K     2.32G  … 2.32G           1 ( 4%)        0%
  cache_references   86.5M  ±  299K     86.1M  … 87.3M           2 ( 8%)        0%
  cache_misses       7.77M  ± 85.3K     7.62M  … 7.90M           0 ( 0%)        0%
  branch_misses      7.11M  ± 33.1K     7.05M  … 7.21M           1 ( 4%)        0%
Benchmark 2 (24 runs): SmpAllocator/bin/zig build-exe ../test/standalone/simple/hello_world/hello.zig
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           208ms ± 7.30ms     196ms …  224ms          0 ( 0%)          -  1.3% ±  2.4%
  peak_rss           79.1MB ±  817KB    77.8MB … 81.2MB          1 ( 4%)        💩+  8.0% ±  0.6%
  cpu_cycles         1.15G  ± 16.9M     1.12G  … 1.18G           0 ( 0%)          -  0.8% ±  0.7%
  instructions       2.22G  ± 28.1K     2.22G  … 2.22G           0 ( 0%)        ⚡-  4.1% ±  0.0%
  cache_references   82.8M  ±  407K     82.1M  … 84.1M           1 ( 4%)        ⚡-  4.3% ±  0.2%
  cache_misses       7.93M  ± 96.6K     7.74M  … 8.12M           0 ( 0%)        💩+  2.1% ±  0.7%
  branch_misses      7.35M  ± 23.6K     7.30M  … 7.40M           0 ( 0%)        💩+  3.4% ±  0.2%

A particularly allocation-heavy ast-check:

Benchmark 1 (32 runs): glibc/bin/zig ast-check ../lib/compiler_rt/udivmodti4_test.zig
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           156ms ± 6.58ms     151ms …  173ms          4 (13%)        0%
  peak_rss           45.0MB ± 20.9KB    45.0MB … 45.1MB          1 ( 3%)        0%
  cpu_cycles          766M  ± 10.2M      754M  …  796M           0 ( 0%)        0%
  instructions       3.19G  ± 12.7      3.19G  … 3.19G           0 ( 0%)        0%
  cache_references   4.12M  ±  498K     3.88M  … 6.13M           3 ( 9%)        0%
  cache_misses        128K  ± 2.42K      125K  …  134K           0 ( 0%)        0%
  branch_misses      1.14M  ±  215K      925K  … 1.43M           0 ( 0%)        0%
Benchmark 2 (34 runs): SmpAllocator/bin/zig ast-check ../lib/compiler_rt/udivmodti4_test.zig
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           149ms ± 1.87ms     146ms …  156ms          1 ( 3%)        ⚡-  4.9% ±  1.5%
  peak_rss           39.6MB ±  141KB    38.8MB … 39.6MB          2 ( 6%)        ⚡- 12.1% ±  0.1%
  cpu_cycles          750M  ± 3.77M      744M  …  756M           0 ( 0%)        ⚡-  2.1% ±  0.5%
  instructions       3.05G  ± 11.5      3.05G  … 3.05G           0 ( 0%)        ⚡-  4.5% ±  0.0%
  cache_references   2.94M  ± 99.2K     2.88M  … 3.36M           4 (12%)        ⚡- 28.7% ±  4.2%
  cache_misses       48.2K  ± 1.07K     45.6K  … 52.1K           2 ( 6%)        ⚡- 62.4% ±  0.7%
  branch_misses       890K  ± 28.8K      862K  … 1.02M           2 ( 6%)        ⚡- 21.8% ±  6.5%

Building the self-hosted compiler:

Benchmark 1 (3 runs): glibc/bin/zig build -Dno-lib -p trash
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          12.2s  ± 99.4ms    12.1s  … 12.3s           0 ( 0%)        0%
  peak_rss            975MB ± 21.7MB     951MB …  993MB          0 ( 0%)        0%
  cpu_cycles         88.7G  ± 68.3M     88.7G  … 88.8G           0 ( 0%)        0%
  instructions        188G  ± 1.40M      188G  …  188G           0 ( 0%)        0%
  cache_references   5.88G  ± 33.2M     5.84G  … 5.90G           0 ( 0%)        0%
  cache_misses        383M  ± 2.26M      381M  …  385M           0 ( 0%)        0%
  branch_misses       368M  ± 1.77M      366M  …  369M           0 ( 0%)        0%
Benchmark 2 (3 runs): SmpAllocator/fast/bin/zig build -Dno-lib -p trash
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          12.2s  ± 49.0ms    12.2s  … 12.3s           0 ( 0%)          +  0.0% ±  1.5%
  peak_rss            953MB ± 3.47MB     950MB …  957MB          0 ( 0%)          -  2.2% ±  3.6%
  cpu_cycles         88.4G  ±  165M     88.2G  … 88.6G           0 ( 0%)          -  0.4% ±  0.3%
  instructions        181G  ± 6.31M      181G  …  181G           0 ( 0%)        ⚡-  3.9% ±  0.0%
  cache_references   5.48G  ± 17.5M     5.46G  … 5.50G           0 ( 0%)        ⚡-  6.9% ±  1.0%
  cache_misses        386M  ± 1.85M      384M  …  388M           0 ( 0%)          +  0.6% ±  1.2%
  branch_misses       377M  ±  899K      377M  …  378M           0 ( 0%)        💩+  2.6% ±  0.9%

more performance data points

How to use it

Put something like this in your main function:

var debug_allocator: std.heap.DebugAllocator(.{}) = .init;

pub fn main() !void {
    const gpa, const is_debug = gpa: {
        if (native_os == .wasi) break :gpa .{ std.heap.wasm_allocator, false };
        break :gpa switch (builtin.mode) {
            .Debug, .ReleaseSafe => .{ debug_allocator.allocator(), true },
            .ReleaseFast, .ReleaseSmall => .{ std.heap.smp_allocator, false },
        };
    };
    defer if (is_debug) {
        _ = debug_allocator.deinit();
    };
}

Follow-up issues

An allocator intended to be used in -OReleaseFast mode when
multi-threading is enabled.
and no need for special handling of wasi and windows since we don't ask
for anything more than page-aligned.
@andrewrk andrewrk added the release notes This PR should be mentioned in the release notes. label Feb 7, 2025
In main, now this allocator is chosen by default when compiling without
libc in ReleaseFast or ReleaseSmall, and not targeting WebAssembly.
rotate a couple times before resorting to mapping more memory.
it was always returning max_cpu_count
* slab length reduced to 64K
* track freelist length with u8s
* on free(), rotate if freelist length exceeds max_freelist_len

Prevents memory leakage in the scenario where one thread only allocates
and another thread only frees.
const cpu_count = @atomicLoad(u32, &global.cpu_count, .unordered);
if (cpu_count != 0) return cpu_count;
const n: u32 = @min(std.Thread.getCpuCount() catch max_thread_count, max_thread_count);
return if (@cmpxchgStrong(u32, &global.cpu_count, 0, n, .monotonic, .monotonic)) |other| other else n;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be an atomicStore, unless u expect Thread.getCpuCount() to return different results on different threads.

}
const cpu_count = getCpuCount();
assert(cpu_count != 0);
while (true) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At some point, this should probably use t.mutex.lock() otherwise this is a spinlock. Maybe after the first for (0..cpu_count) tryLocks?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release notes This PR should be mentioned in the release notes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants