Pushing Swift ML on Apple Silicon

Developers are demonstrating that native Swift can be tuned to deliver high-performance matrix multiplication for LLM training on Apple Silicon. A detailed write-up shows a stepwise port of Andrej Karpathy’s minimal llm.c into Swift, with hands-on optimizations across CPU, SIMD, AMX and GPU (including Metal) to push kernels from gigaflops to teraflops and measure full forward/backward iteration costs. The work highlights hardware- and stack-level trade-offs, suggesting Swift can approach C-level performance without external libraries. Complementary tooling like omlx—an Apple Silicon-focused LLM inference server offering continuous batching and SSD caching—underscores a growing ecosystem for native Mac ML workloads.

Latest Changes

Detailed write-up shows stepwise Swift port of llm.c with hands-on CPU, SIMD, AMX and GPU optimizations

Benchmarks report moving matrix multiply kernels from gigaflops to teraflops in Swift on Apple Silicon

Emergence of omlx, an Apple Silicon-focused LLM inference server with continuous batching and SSD caching

User report shows MTP did not improve throughput on an M2 Max with 96GB for certain 27B models

Timeline

2026-05-11 — Developer publishes first installment on training an LLM in Swift focused on maximizing matrix multiply performance

2026-05-11 — Another report documents hand-optimized Swift kernels for GPT-like LLM training on Apple Silicon

2026-05-11 — omlx announced as an LLM inference server for Apple Silicon with continuous batching and SSD caching

2026-05-19 — User reports MTP provided no throughput improvement on M2 Max for testing froggeric and unsloth 27B models

Recent News (4)

MTP and Apple Silicon, any benefits ?

A user reports that using MTP (Mixture of True Priors / memory/transfer patching context) with Apple Silicon didn’t improve throughput: on an M2 Max with 96 GB they tested froggeric and unsloth 27B models and observed about 9 tokens/sec versus ~12 tokens/sec without MTP. They tried different spec/draft/n/max settings (2, 3, 6) and note a high acceptance rate (>70%), asking why MTP underperforms. This matters to developers and researchers optimizing LLM inference on Apple Silicon because it questions whether MTP yields latency/throughput gains on local macOS ARM setups and suggests configuration, model, or implementation issues. Troubleshooting should consider model compilation, backend (ggml, gguf), threading, quantization, and memory bandwidth constraints on M2 Max.

src_reddit_llm/u/arkham001d ago

Training an LLM in Swift, Part 1: Taking matrix mult from Gflop/s to Tflop/s

A Swift developer reimplemented Andrej Karpathy’s llm.c to train a GPT2-like LLM on Apple Silicon and focused this first installment on squeezing maximal matrix-multiply performance from Swift without libraries. Starting from a very slow naive port, the author explores step-by-step optimizations across Apple Silicon compute units—CPU, SIMD, AMX and GPU—aiming to move kernels from gigaflops to teraflops and measuring full forward/backward iteration costs. The piece promises future articles comparing Apple’s ML frameworks and shows hands-on, low-level techniques (including a Metal example) for getting plain Swift code competitive with C for ML workloads. It matters because it demonstrates practical paths for native Swift ML training on Macs and reveals hardware/stack trade-offs for developers.

18pts

Zelizdw

Pushing Swift ML on Apple Silicon

Why It Matters

Latest Changes

Timeline

What to Watch

Recent News (4)