llama.cpp Adds step3-vl-10b Vision-Language Support

The llama.cpp ecosystem is rapidly expanding its model compatibility and performance features, now adding support for the step3-vl-10b vision-language checkpoint while MTP (Multi-Token Prediction) support enters beta. Community contributors continue to upstream model-specific PRs (e.g., Mimo v2.5, step3-vl-10b) and share GGUF builds and Docker images for Gemma4 and Qwen variants to ease local deployment. Parallel work on MTP—both native and via unmerged patches—plus quantized GGUF builds is delivering substantial throughput gains on CPUs and modest GPUs, making on-device multimodal and large-context inference faster and more accessible for developers and hobbyists.

Recent News (13)

Got MTP + TurboQuant running — Qwen3.6-27B -- 80+ t/s at 262K context on a single RTX 4090

A developer reports getting MTP (Memory Tensor Paging) working together with TurboQuant TBQ4_0 (a lossless 4.25 bpv KV-cache quantization) on Qwen3.6-27B, achieving 80–87 tokens/sec and a 262K-token context on a single RTX 4090 after optimization. Initial runs compiled at ~43 t/s; optimizations and MTP draft acceptance (~73%) raised throughput significantly. The work shows practical feasibility of large-context inference for a 27B LLM on consumer GPUs using TBQ4_0 and paging techniques, which matters for lowering hardware requirements and enabling long-context applications. This is relevant to researchers and engineers exploring efficient inference, quantized KV caches, and model runtime engineering.

src_reddit_llm/u/indrasmirror5h ago

The Local Model That Doesn't Sleep: Gemma 4 + MTP as a Marathon Engine

A developer recounts a failed overnight job on a hosted API and argues for local models as reliable long-running “marathon” engines. After a remote service outage froze a scrape-and-summarize agent, they shifted to running Gemma 4 (31B) locally and found a sweet spot: models that fit on consumer GPUs and run offline without quotas or downtime. The piece contrasts cloud “sprint” uses—high-precision, compute-heavy iterations—with local setups optimized for endurance and continuous work. It highlights recent advances, notably Gemma 4 and Multi-Token Prediction (MTP), which boost local throughput so models can process more tasks overnight without burning out. This matters for developers needing resilient, cost-effective uninterrupted processing.

8pts

Dev.toertugrul_demir