Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metal compute shader implementation #26

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

cryptocode
Copy link

With OpenCL being deprecated on macOS, having a Metal implementation may eventually be necessary/convenient. Here's an initial implementation, which is opt-in through a cargo feature. I think the shader implementation is in good shape and can be reused in other language bindings. The Rust code can be improved, but it's a start.

Benchmark

Below is the sample benchmark output from a 16" MacBook Pro. I'm getting similar results when using 1M threads on OpenCL (work items) and 16K per-thread-group threads on Metal:

$cargo build --release --features metalsdk
$target/release/nano-work-server -g 0:0:16384

Metal SDK device 0: AMD Radeon Pro 5500M  (selected)
Metal SDK device 1: Intel(R) UHD Graphics 630
Configured for the live network with threshold fffffff800000000
Ready to receive requests on [::1]:7076
Benchmarking 10 samples at difficulty fffffff800000000 (1x)
Threads per thread group: 1024
Benchmark finished in 6601ms , average 660ms / sample

Note:

  • The GPU index is going to be different when switching between OpenCL and Metal. Run once with -g 0:0 to get a list.

  • The Metal implementation sets the thread group count to the maximum value reported by the device. This is printed when generating work for the first time. The thread argument passed to the work server is the number of threads per thread group. For the 5500M, the sweet spot appears to be around 16384 threads (per thread group)

Implementation details

  • The Metal shader is based on the OpenCL shader. The main difference is that writing to the output buffer is atomic to prevent UB (should probably be fixed for the OpenCL implementation as well) and minor details like ulong2(expr) vs (ulong2)expr. The actual difficulty is also written to the output buffer, though this isn't used by the Rust work server yet to keep code common with the OpenCL implementation.
  • The gpu_metal.rs implementation is structured a bit differently, as I couldn't find a way to put ctype FFI objects (like Buffer's) in the Gpu struct without running into thread safety errors from the compiler. I'm sure it's fixable, this is just an initial implementation. It means the try function does more work than in the OpenCL implementation, including recreating buffers, which will affect performance a bit.

@cryptocode cryptocode self-assigned this Nov 21, 2020
src/work.metal Outdated Show resolved Hide resolved
src/work.metal Outdated Show resolved Hide resolved
src/work.metal Show resolved Hide resolved
src/gpu_metal.rs Show resolved Hide resolved
src/gpu_metal.rs Show resolved Hide resolved
@PlasmaPower
Copy link
Contributor

I couldn't find a way to put ctype FFI objects (like Buffer's) in the Gpu struct without running into thread safety errors from the compiler

Assuming these are thread safe, you can do:

unsafe impl Send for Gpu {}
unsafe impl Sync for Gpu {}

@cryptocode
Copy link
Author

My understanding is that few things in Metal are specified to be thread safe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants