introduce std.heap.SmpAllocator #22808

andrewrk · 2025-02-07T22:35:18Z

An allocator that is designed for ReleaseFast optimization mode, with multi-threading enabled.

This allocator is a singleton; it uses global state and only one should be instantiated for the entire process.

This is a "sweet spot" - the implementation is about 200 lines of code and yet competitive with glibc performance.

Basic Design

Each thread gets a separate freelist, however, the data must be recoverable when the thread exits. We do not directly learn when a thread exits, so occasionally, one thread must attempt to reclaim another thread's resources.

Above a certain size, those allocations are memory mapped directly, with no storage of allocation metadata. This works because the implementation refuses resizes that would move an allocation from small category to large category or vice versa.

Each allocator operation checks the thread identifier from a threadlocal variable to find out which metadata in the global state to access, and attempts to grab its lock. This will usually succeed without contention, unless another thread has been assigned the same id. In the case of such contention, the thread moves on to the next thread metadata slot and repeats the process of attempting to obtain the lock.

By limiting the thread-local metadata array to the same number as the CPU count, ensures that as threads are created and destroyed, they cycle through the full set of freelists.

Performance Data Points

This is building hello world with glibc vs SmpAllocator:

master branch (0.14.0-dev.3145+6a6e72fff) stage3/bin/zig build -p glibc -Doptimize=ReleaseFast -Dno-lib -Dforce-link-libc
this branch, stage3/bin/zig build -p SmpAllocator -Doptimize=ReleaseFast -Dno-lib, which now uses SmpAllocator rather than DebugAllocator with this build configuration

Benchmark 1 (24 runs): glibc/bin/zig build-exe ../test/standalone/simple/hello_world/hello.zig
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           211ms ± 9.91ms     193ms …  237ms          4 (17%)        0%
  peak_rss           73.2MB ±  708KB    71.9MB … 74.3MB          0 ( 0%)        0%
  cpu_cycles         1.16G  ± 9.10M     1.14G  … 1.18G           0 ( 0%)        0%
  instructions       2.32G  ± 81.4K     2.32G  … 2.32G           1 ( 4%)        0%
  cache_references   86.5M  ±  299K     86.1M  … 87.3M           2 ( 8%)        0%
  cache_misses       7.77M  ± 85.3K     7.62M  … 7.90M           0 ( 0%)        0%
  branch_misses      7.11M  ± 33.1K     7.05M  … 7.21M           1 ( 4%)        0%
Benchmark 2 (24 runs): SmpAllocator/bin/zig build-exe ../test/standalone/simple/hello_world/hello.zig
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           208ms ± 7.30ms     196ms …  224ms          0 ( 0%)          -  1.3% ±  2.4%
  peak_rss           79.1MB ±  817KB    77.8MB … 81.2MB          1 ( 4%)        💩+  8.0% ±  0.6%
  cpu_cycles         1.15G  ± 16.9M     1.12G  … 1.18G           0 ( 0%)          -  0.8% ±  0.7%
  instructions       2.22G  ± 28.1K     2.22G  … 2.22G           0 ( 0%)        ⚡-  4.1% ±  0.0%
  cache_references   82.8M  ±  407K     82.1M  … 84.1M           1 ( 4%)        ⚡-  4.3% ±  0.2%
  cache_misses       7.93M  ± 96.6K     7.74M  … 8.12M           0 ( 0%)        💩+  2.1% ±  0.7%
  branch_misses      7.35M  ± 23.6K     7.30M  … 7.40M           0 ( 0%)        💩+  3.4% ±  0.2%

A particularly allocation-heavy ast-check:

Benchmark 1 (32 runs): glibc/bin/zig ast-check ../lib/compiler_rt/udivmodti4_test.zig
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           156ms ± 6.58ms     151ms …  173ms          4 (13%)        0%
  peak_rss           45.0MB ± 20.9KB    45.0MB … 45.1MB          1 ( 3%)        0%
  cpu_cycles          766M  ± 10.2M      754M  …  796M           0 ( 0%)        0%
  instructions       3.19G  ± 12.7      3.19G  … 3.19G           0 ( 0%)        0%
  cache_references   4.12M  ±  498K     3.88M  … 6.13M           3 ( 9%)        0%
  cache_misses        128K  ± 2.42K      125K  …  134K           0 ( 0%)        0%
  branch_misses      1.14M  ±  215K      925K  … 1.43M           0 ( 0%)        0%
Benchmark 2 (34 runs): SmpAllocator/bin/zig ast-check ../lib/compiler_rt/udivmodti4_test.zig
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           149ms ± 1.87ms     146ms …  156ms          1 ( 3%)        ⚡-  4.9% ±  1.5%
  peak_rss           39.6MB ±  141KB    38.8MB … 39.6MB          2 ( 6%)        ⚡- 12.1% ±  0.1%
  cpu_cycles          750M  ± 3.77M      744M  …  756M           0 ( 0%)        ⚡-  2.1% ±  0.5%
  instructions       3.05G  ± 11.5      3.05G  … 3.05G           0 ( 0%)        ⚡-  4.5% ±  0.0%
  cache_references   2.94M  ± 99.2K     2.88M  … 3.36M           4 (12%)        ⚡- 28.7% ±  4.2%
  cache_misses       48.2K  ± 1.07K     45.6K  … 52.1K           2 ( 6%)        ⚡- 62.4% ±  0.7%
  branch_misses       890K  ± 28.8K      862K  … 1.02M           2 ( 6%)        ⚡- 21.8% ±  6.5%

Building the self-hosted compiler:

Benchmark 1 (3 runs): glibc/bin/zig build -Dno-lib -p trash
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          12.2s  ± 99.4ms    12.1s  … 12.3s           0 ( 0%)        0%
  peak_rss            975MB ± 21.7MB     951MB …  993MB          0 ( 0%)        0%
  cpu_cycles         88.7G  ± 68.3M     88.7G  … 88.8G           0 ( 0%)        0%
  instructions        188G  ± 1.40M      188G  …  188G           0 ( 0%)        0%
  cache_references   5.88G  ± 33.2M     5.84G  … 5.90G           0 ( 0%)        0%
  cache_misses        383M  ± 2.26M      381M  …  385M           0 ( 0%)        0%
  branch_misses       368M  ± 1.77M      366M  …  369M           0 ( 0%)        0%
Benchmark 2 (3 runs): SmpAllocator/fast/bin/zig build -Dno-lib -p trash
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          12.2s  ± 49.0ms    12.2s  … 12.3s           0 ( 0%)          +  0.0% ±  1.5%
  peak_rss            953MB ± 3.47MB     950MB …  957MB          0 ( 0%)          -  2.2% ±  3.6%
  cpu_cycles         88.4G  ±  165M     88.2G  … 88.6G           0 ( 0%)          -  0.4% ±  0.3%
  instructions        181G  ± 6.31M      181G  …  181G           0 ( 0%)        ⚡-  3.9% ±  0.0%
  cache_references   5.48G  ± 17.5M     5.46G  … 5.50G           0 ( 0%)        ⚡-  6.9% ±  1.0%
  cache_misses        386M  ± 1.85M      384M  …  388M           0 ( 0%)          +  0.6% ±  1.2%
  branch_misses       377M  ±  899K      377M  …  378M           0 ( 0%)        💩+  2.6% ±  0.9%

more performance data points

How to use it

Put something like this in your main function:

var debug_allocator: std.heap.DebugAllocator(.{}) = .init;

pub fn main() !void {
    const gpa, const is_debug = gpa: {
        if (native_os == .wasi) break :gpa .{ std.heap.wasm_allocator, false };
        break :gpa switch (builtin.mode) {
            .Debug, .ReleaseSafe => .{ debug_allocator.allocator(), true },
            .ReleaseFast, .ReleaseSmall => .{ std.heap.smp_allocator, false },
        };
    };
    defer if (is_debug) {
        _ = debug_allocator.deinit();
    };
}

Follow-up issues

Provide some kind of abstraction that does the above logic for choosing an allocator
Look into asking for page align rather than slab align
Look into restartable sequences
Look into VirtualAlloc2 to improve PageAllocator
investigate SmpAllocator performance with respect to other popular allocators #12484

An allocator intended to be used in -OReleaseFast mode when multi-threading is enabled.

and no need for special handling of wasi and windows since we don't ask for anything more than page-aligned.

In main, now this allocator is chosen by default when compiling without libc in ReleaseFast or ReleaseSmall, and not targeting WebAssembly.

rotate a couple times before resorting to mapping more memory.

it was always returning max_cpu_count

* slab length reduced to 64K * track freelist length with u8s * on free(), rotate if freelist length exceeds max_freelist_len Prevents memory leakage in the scenario where one thread only allocates and another thread only frees.

kprotty · 2025-02-08T12:06:00Z

lib/std/heap/SmpAllocator.zig

+    const cpu_count = @atomicLoad(u32, &global.cpu_count, .unordered);
+    if (cpu_count != 0) return cpu_count;
+    const n: u32 = @min(std.Thread.getCpuCount() catch max_thread_count, max_thread_count);
+    return if (@cmpxchgStrong(u32, &global.cpu_count, 0, n, .monotonic, .monotonic)) |other| other else n;


could be an atomicStore, unless u expect Thread.getCpuCount() to return different results on different threads.

kprotty · 2025-02-08T12:07:53Z

lib/std/heap/SmpAllocator.zig

+        }
+        const cpu_count = getCpuCount();
+        assert(cpu_count != 0);
+        while (true) {


At some point, this should probably use t.mutex.lock() otherwise this is a spinlock. Maybe after the first for (0..cpu_count) tryLocks?

andrewrk added 3 commits February 7, 2025 12:20

add std.heap.SmpAllocator

51c4ffa

An allocator intended to be used in -OReleaseFast mode when multi-threading is enabled.

std.heap: test smp_allocator

3d7c5cf

std.heap.SmpAllocator: 256K slab_len

84bf7a6

and no need for special handling of wasi and windows since we don't ask for anything more than page-aligned.

andrewrk added the release notes This PR should be mentioned in the release notes. label Feb 7, 2025

andrewrk added 8 commits February 7, 2025 14:41

compiler: use std.heap.smp_allocator

7360be1

In main, now this allocator is chosen by default when compiling without libc in ReleaseFast or ReleaseSmall, and not targeting WebAssembly.

std.heap.SmpAllocator: implement searching on alloc

60765a9

rotate a couple times before resorting to mapping more memory.

std.heap.SmpAllocator: eliminate the global mutex

839c453

std.heap.SmpAllocator: fix using wrong size class indices

1ffae59

std.heap.SmpAllocator: simplify by putting freelist node at start

88e2e60

std.heap.SmpAllocator: fix getCpuCount logic

3246150

it was always returning max_cpu_count

std.heap.SmpAllocator: fix detection of slab end

a9d3005

std.heap.SmpAllocator: rotate on free sometimes

1754e01

* slab length reduced to 64K * track freelist length with u8s * on free(), rotate if freelist length exceeds max_freelist_len Prevents memory leakage in the scenario where one thread only allocates and another thread only frees.

andrewrk force-pushed the fast-gpa branch from 3358440 to 1754e01 Compare February 7, 2025 22:41

andrewrk and others added 2 commits February 7, 2025 15:36

don't try to test SmpAllocator in single threaded mode

bfabb70

musl: Align the stack pointer given to clone() on riscv.

975cd9f

SeanTheGleaming mentioned this pull request Feb 8, 2025

Add mremap to std.posix, and use it in std.heap.PageAllocator when available. #22199

Closed

andrewrk enabled auto-merge February 8, 2025 04:49

andrewrk mentioned this pull request Feb 8, 2025

investigate SmpAllocator performance with respect to other popular allocators #12484

Open

7 tasks

andrewrk merged commit ea1ce2d into master Feb 8, 2025
9 of 10 checks passed

andrewrk deleted the fast-gpa branch February 8, 2025 12:54

kprotty reviewed Feb 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

introduce std.heap.SmpAllocator #22808

introduce std.heap.SmpAllocator #22808

andrewrk commented Feb 7, 2025 •

edited

Loading

kprotty Feb 8, 2025

kprotty Feb 8, 2025

introduce std.heap.SmpAllocator #22808

introduce std.heap.SmpAllocator #22808

Conversation

andrewrk commented Feb 7, 2025 • edited Loading

Basic Design

Performance Data Points

How to use it

Follow-up issues

kprotty Feb 8, 2025

Choose a reason for hiding this comment

kprotty Feb 8, 2025

Choose a reason for hiding this comment

andrewrk commented Feb 7, 2025 •

edited

Loading