Loading...
Loading...
Developers are demonstrating that native Swift can be tuned to deliver high-performance matrix multiplication for LLM training on Apple Silicon. A detailed write-up shows a stepwise port of Andrej Karpathy’s minimal llm.c into Swift, with hands-on optimizations across CPU, SIMD, AMX and GPU (including Metal) to push kernels from gigaflops to teraflops and measure full forward/backward iteration costs. The work highlights hardware- and stack-level trade-offs, suggesting Swift can approach C-level performance without external libraries. Complementary tooling like omlx—an Apple Silicon-focused LLM inference server offering continuous batching and SSD caching—underscores a growing ecosystem for native Mac ML workloads.
Demonstrates Swift can be tuned to deliver near-C performance for ML workloads on Apple Silicon, enabling native training and inference without relying on external libraries. This impacts deployment choices, toolchains, and hardware utilization strategies for macOS-focused ML development.
Dossier last updated: 2026-05-19 14:35:02
A user reports that using MTP (Mixture of True Priors / memory/transfer patching context) with Apple Silicon didn’t improve throughput: on an M2 Max with 96 GB they tested froggeric and unsloth 27B models and observed about 9 tokens/sec versus ~12 tokens/sec without MTP. They tried different spec/draft/n/max settings (2, 3, 6) and note a high acceptance rate (>70%), asking why MTP underperforms. This matters to developers and researchers optimizing LLM inference on Apple Silicon because it questions whether MTP yields latency/throughput gains on local macOS ARM setups and suggests configuration, model, or implementation issues. Troubleshooting should consider model compilation, backend (ggml, gguf), threading, quantization, and memory bandwidth constraints on M2 Max.
A Swift developer reimplemented Andrej Karpathy’s llm.c to train a GPT2-like LLM on Apple Silicon and focused this first installment on squeezing maximal matrix-multiply performance from Swift without libraries. Starting from a very slow naive port, the author explores step-by-step optimizations across Apple Silicon compute units—CPU, SIMD, AMX and GPU—aiming to move kernels from gigaflops to teraflops and measuring full forward/backward iteration costs. The piece promises future articles comparing Apple’s ML frameworks and shows hands-on, low-level techniques (including a Metal example) for getting plain Swift code competitive with C for ML workloads. It matters because it demonstrates practical paths for native Swift ML training on Macs and reveals hardware/stack trade-offs for developers.
A developer documents efforts to hand-optimize matrix multiplication and other kernels in Swift to accelerate training a GPT-like LLM on Apple Silicon, aiming to move performance from gigaflops to teraflops. Using Andrej Karpathy’s minimal llm.c as the reference model, the author rewrote it in Swift and iteratively tuned CPU, SIMD, AMX and GPU paths without third-party libraries, measuring full forward and backward training iteration costs. The piece previews further articles comparing Apple’s ML frameworks and shows a Metal GPU implementation, emphasizing low-level optimization trade-offs and practical performance on Mac hardware. It matters because it exposes how much performance can be gained by language- and platform-specific tuning versus relying on existing ML libraries.
omlx: LLM inference server with continuous batching & SSD caching for Apple Silicon — managed from the macOS menu bar