Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
With OpenCL being deprecated on macOS, having a Metal implementation may eventually be necessary/convenient. Here's an initial implementation, which is opt-in through a cargo feature. I think the shader implementation is in good shape and can be reused in other language bindings. The Rust code can be improved, but it's a start.
Benchmark
Below is the sample benchmark output from a 16" MacBook Pro. I'm getting similar results when using 1M threads on OpenCL (work items) and 16K per-thread-group threads on Metal:
Note:
The GPU index is going to be different when switching between OpenCL and Metal. Run once with
-g 0:0
to get a list.The Metal implementation sets the thread group count to the maximum value reported by the device. This is printed when generating work for the first time. The thread argument passed to the work server is the number of threads per thread group. For the 5500M, the sweet spot appears to be around 16384 threads (per thread group)
Implementation details
gpu_metal.rs
implementation is structured a bit differently, as I couldn't find a way to put ctype FFI objects (like Buffer's) in the Gpu struct without running into thread safety errors from the compiler. I'm sure it's fixable, this is just an initial implementation. It means thetry
function does more work than in the OpenCL implementation, including recreating buffers, which will affect performance a bit.