Magpie is an AI-native language.

Powered by torch-magpie: A complete PyTorch reimplementation in Rust, purpose-built as the tensor backend for the Magpie compiler. GPU-accelerated deep learning on Apple Silicon.

View on GitHub API Reference

torch-magpie-guide ~ bash

torch-magpie> Hello! I'm the torch-magpie guide. Ask me anything about the MLX-powered PyTorch backend.

Why torch-magpie?

Modern deep learning frameworks introduce overhead for every tensor operation. Torch-magpie takes a different approach.

⛓️

Fused Transformer Layers

It fuses entire transformer layers into lazy MLX computation graphs submitted as single Metal command buffers, eliminating intermediate tensor allocations.

🚀

GPU-Resident KV-Cache

Key/value caches are stored directly on the GPU. There is absolutely no CPU round-trip during autoregressive decoding, leading to maximum throughput.

💾

Memory-Efficient Attention

Utilizing MLX's tiled scaled dot-product attention kernel allows for 100k-token prefill sequences that traditional PyTorch MPS cannot run due to memory constraints.

The Numbers Speak for Themselves

Benchmarked on Apple M1 Pro (10-core CPU, 16-core GPU): PyTorch 2.8.0 MPS vs torch-magpie MLX.

BERT Training

1.41x Speedup

Time per step (6L/768H, batch=4).

Magpie

110.8 ms

PyTorch

155.9 ms

GPT-2 Prefill

2.85x Speedup

Time for 128 tokens.

Magpie

6.8 ms

PyTorch

19.4 ms

GPT-2 Decode

2.29x Faster

Tokens per second (6L/768H, greedy).

Magpie

294.5 tok/s

PyTorch

128.6 tok/s

Llama-3.2-1B Prefill

100k Token Capable

PyTorch OOMs on 100k tokens. Magpie handles it easily.

Magpie

316 tok/s (100k tok)

PyTorch

FAILED OOM

View Detailed Report

3-Tier GPU Dispatch

Every tensor operation follows a dispatch chain that maximizes GPU utilization.

Architecture Layers

; 1. MLX
Lazy evaluation on Apple Silicon GPU via MLX library. Operations are deferred 
and batched into a single Metal command buffer.

; 2. Metal
Custom Metal Shading Language kernels for operations not covered by MLX 
(batched matmul, elementwise ops, softmax).

; 3. CPU
Rust implementation as final fallback ensuring full op coverage.