Powered by torch-magpie: A complete PyTorch reimplementation in Rust, purpose-built as the tensor backend for the Magpie compiler. GPU-accelerated deep learning on Apple Silicon.
Modern deep learning frameworks introduce overhead for every tensor operation. Torch-magpie takes a different approach.
It fuses entire transformer layers into lazy MLX computation graphs submitted as single Metal command buffers, eliminating intermediate tensor allocations.
Key/value caches are stored directly on the GPU. There is absolutely no CPU round-trip during autoregressive decoding, leading to maximum throughput.
Utilizing MLX's tiled scaled dot-product attention kernel allows for 100k-token prefill sequences that traditional PyTorch MPS cannot run due to memory constraints.
Benchmarked on Apple M1 Pro (10-core CPU, 16-core GPU): PyTorch 2.8.0 MPS vs torch-magpie MLX.
Time per step (6L/768H, batch=4).
Time for 128 tokens.
Tokens per second (6L/768H, greedy).
PyTorch OOMs on 100k tokens. Magpie handles it easily.
Every tensor operation follows a dispatch chain that maximizes GPU utilization.
; 1. MLX
Lazy evaluation on Apple Silicon GPU via MLX library. Operations are deferred
and batched into a single Metal command buffer.
; 2. Metal
Custom Metal Shading Language kernels for operations not covered by MLX
(batched matmul, elementwise ops, softmax).
; 3. CPU
Rust implementation as final fallback ensuring full op coverage.