Metal compute shader implementation #26

cryptocode · 2020-11-21T21:31:15Z

With OpenCL being deprecated on macOS, having a Metal implementation may eventually be necessary/convenient. Here's an initial implementation, which is opt-in through a cargo feature. I think the shader implementation is in good shape and can be reused in other language bindings. The Rust code can be improved, but it's a start.

Benchmark

Below is the sample benchmark output from a 16" MacBook Pro. I'm getting similar results when using 1M threads on OpenCL (work items) and 16K per-thread-group threads on Metal:

$cargo build --release --features metalsdk
$target/release/nano-work-server -g 0:0:16384

Metal SDK device 0: AMD Radeon Pro 5500M  (selected)
Metal SDK device 1: Intel(R) UHD Graphics 630
Configured for the live network with threshold fffffff800000000
Ready to receive requests on [::1]:7076
Benchmarking 10 samples at difficulty fffffff800000000 (1x)
Threads per thread group: 1024
Benchmark finished in 6601ms , average 660ms / sample

Note:

The GPU index is going to be different when switching between OpenCL and Metal. Run once with -g 0:0 to get a list.
The Metal implementation sets the thread group count to the maximum value reported by the device. This is printed when generating work for the first time. The thread argument passed to the work server is the number of threads per thread group. For the 5500M, the sweet spot appears to be around 16384 threads (per thread group)

Implementation details

The Metal shader is based on the OpenCL shader. The main difference is that writing to the output buffer is atomic to prevent UB (should probably be fixed for the OpenCL implementation as well) and minor details like ulong2(expr) vs (ulong2)expr. The actual difficulty is also written to the output buffer, though this isn't used by the Rust work server yet to keep code common with the OpenCL implementation.
The gpu_metal.rs implementation is structured a bit differently, as I couldn't find a way to put ctype FFI objects (like Buffer's) in the Gpu struct without running into thread safety errors from the compiler. I'm sure it's fixable, this is just an initial implementation. It means the try function does more work than in the OpenCL implementation, including recreating buffers, which will affect performance a bit.

src/work.metal

src/gpu_metal.rs

PlasmaPower · 2020-11-21T21:52:48Z

I couldn't find a way to put ctype FFI objects (like Buffer's) in the Gpu struct without running into thread safety errors from the compiler

Assuming these are thread safe, you can do:

unsafe impl Send for Gpu {}
unsafe impl Sync for Gpu {}

cryptocode · 2020-11-21T22:22:37Z

My understanding is that few things in Metal are specified to be thread safe.

Metal shader implementation

900cb28

cryptocode self-assigned this Nov 21, 2020

PlasmaPower reviewed Nov 21, 2020

View reviewed changes

src/work.metal Outdated Show resolved Hide resolved

src/work.metal Outdated Show resolved Hide resolved

src/work.metal Show resolved Hide resolved

src/gpu_metal.rs Show resolved Hide resolved

src/gpu_metal.rs Show resolved Hide resolved

Implement PlasmaPower's CAS loop suggestion

8a64aaa

PlasmaPower approved these changes Nov 21, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metal compute shader implementation #26

Metal compute shader implementation #26

cryptocode commented Nov 21, 2020

PlasmaPower commented Nov 21, 2020

cryptocode commented Nov 21, 2020

Metal compute shader implementation #26

Are you sure you want to change the base?

Metal compute shader implementation #26

Conversation

cryptocode commented Nov 21, 2020

Benchmark

Implementation details

PlasmaPower commented Nov 21, 2020

cryptocode commented Nov 21, 2020