diff --git a/post/2024-10-02-metal-1.4.md b/post/2024-10-02-metal-1.4.md new file mode 100644 index 0000000..5813510 --- /dev/null +++ b/post/2024-10-02-metal-1.4.md @@ -0,0 +1,57 @@ ++++ +title = "Metal.jl 1.4: Metal.rand" +author = "Christian Guinard" +abstract = """ + Metal.jl 1.4 adds higher-quality on-device random number generation from Metal Performance + Shaders. Some limitations apply, with fallback to the previously-existing rand + implementation in those situations.""" ++++ +{{abstract}} + +## Metal.rand and friends + +Using functionality provided by the Metal Performance Shaders, improved on-gpu random number +generation has been implemented. Uniform distributions using `Metal.rand` (and its in-place +variant `Metal.rand!`) are available for all Metal-supported integer types and Float32. +However, due to Metal [API](https://developer.apple.com/documentation/metal/mtlblitcommandencoder/1400767-copyfrombuffer?language=objc) +limitations, 8-bit and 16-bit integers may fall back to the lower-quality GPUArrays.jl random +numbers if their size in bytes is not a multiple of 4. Normally distributed Float32 values can be +generated for with `Metal.randn` and `Metal.randn!`. Float16 is not supported by the Metal +Performance Shaders RNG, and will always fall back to the GPUArrays implementation. + +The easiest way to use these is to use the Metal convenience functions `Metal.rand[n][!]` +as you would the usual functions. However, the regular Random.jl methods can also be used +by providing the appropriate `RNG` either from `MPS.default_rng()` or `MPS.RNG()` to the +standard `Random.rand[n][!]` functions. + + + +## Other improvements since the last blog post + +- Since v0.5: `MtlArray` storage mode has been parameterized, allowing one to create a shared storage `MtlArray` + by calling `MtlArray{eltype, ndims, Metal.SharedStorage}(...)`. +- Since v0.3: MPS-accelerated decompositions were added. +- Various performance improvements +- *Many* bug fixes. + + +## Future work + +Although Metal.jl is now in v1, there is still work to be done to make it as fast and +feature-complete. In particular: + +- since the last blog post, Metal.jl has started using native ObjectiveC FFI for wrapping + Metal APIs. However, these wrappers have to be written manually for every piece of + Objective-C code. We are looking for help improving Clang.jl and ObjectiveC.jl to enable + the automatic generation of these wrappers. See tracking [issue](https://github.com/JuliaInterop/ObjectiveC.jl/issues/41); +- the MPS wrappers are incomplete, automatic wrapper generation would greatly help with + full MPS support; +- support for atomic operations is missing, which is required to implement a full-featured + KernelAbstractions.jl back-end. See tracking [issue](https://github.com/JuliaGPU/Metal.jl/issues/218); +- full support for BFloat16 values, which has been supported since Metal 3.1 (macOS 14), + is not yet in Metal.jl. See tracking [issue](https://github.com/JuliaGPU/Metal.jl/issues/298); +- some functionality present in CUDA.jl could be ported to Metal.jl to improve usability. + See tracking [issue](https://github.com/JuliaGPU/Metal.jl/issues/443); +- general performance improvements. In particular, improvements to the ObjectiveC.jl type model + could greatly reduce the number of allocations currently necessary for every + Objective-C/Metal operation. See tracking [issue](https://github.com/JuliaInterop/ObjectiveC.jl/issues/13).