Loading...
Loading...
The llama.cpp ecosystem is rapidly expanding its model compatibility and performance features, now adding support for the step3-vl-10b vision-language checkpoint while MTP (Multi-Token Prediction) support enters beta. Community contributors continue to upstream model-specific PRs (e.g., Mimo v2.5, step3-vl-10b) and share GGUF builds and Docker images for Gemma4 and Qwen variants to ease local deployment. Parallel work on MTP—both native and via unmerged patches—plus quantized GGUF builds is delivering substantial throughput gains on CPUs and modest GPUs, making on-device multimodal and large-context inference faster and more accessible for developers and hobbyists.
A developer reports getting MTP (Memory Tensor Paging) working together with TurboQuant TBQ4_0 (a lossless 4.25 bpv KV-cache quantization) on Qwen3.6-27B, achieving 80–87 tokens/sec and a 262K-token context on a single RTX 4090 after optimization. Initial runs compiled at ~43 t/s; optimizations and MTP draft acceptance (~73%) raised throughput significantly. The work shows practical feasibility of large-context inference for a 27B LLM on consumer GPUs using TBQ4_0 and paging techniques, which matters for lowering hardware requirements and enabling long-context applications. This is relevant to researchers and engineers exploring efficient inference, quantized KV caches, and model runtime engineering.
A developer recounts a failed overnight job on a hosted API and argues for local models as reliable long-running “marathon” engines. After a remote service outage froze a scrape-and-summarize agent, they shifted to running Gemma 4 (31B) locally and found a sweet spot: models that fit on consumer GPUs and run offline without quotas or downtime. The piece contrasts cloud “sprint” uses—high-precision, compute-heavy iterations—with local setups optimized for endurance and continuous work. It highlights recent advances, notably Gemma 4 and Multi-Token Prediction (MTP), which boost local throughput so models can process more tasks overnight without burning out. This matters for developers needing resilient, cost-effective uninterrupted processing.
A developer reports disappointment with Qwen 3.6's coding assistance while migrating from Codex. They are using a midsize stack (Kotlin Android app, Rust backend, Postgres) and have tried feeding well-documented features into a local setup combining llama.cpp, Opencode, and Qwen 3.6 (27B/35B, Q4_K_M, 128K context) with tooling for rules, skills, multi-code projects, and code indexing. The user describes reliability and quality issues: hallucinations, incorrect or non-compilable code, poor handling of medium-complexity tasks, and failure to follow provided constraints. They note occasional useful snippets but overall regression versus expectations, highlighting limits of current large open models for dependable software engineering at scale.
A community implementation of Multi-Token Prediction (MTP) for LLaMA.cpp reportedly speeds up Gemma 4 inference by about 40%. Posted on the LocalLLaMA subreddit, the patch adapts MTP—predicting multiple tokens per pass—to the popular C++ LLaMA runtime, improving throughput without changing model weights. Key players include the LLaMA.cpp project and Gemma 4 model; contributors are community developers sharing code and benchmarks. This matters because MTP boosts performance on CPU and lightweight deployments, lowering latency and compute costs for local AI inference and enabling better UX on edge devices. If adopted upstream, it could become a practical optimization for many open-source LLM runtimes and apps.
A community contributor uploaded a GGUF build of the nvidia/Gemma-4-26B-A4B-NVFP4 large language model and provided a companion Docker image (catlilface/llama.cpp:gemma4_26b_nvfp4) because the main llama.cpp branch currently doesn’t support running it. The author warns limited testing due to only having an NVIDIA RTX 5070 Ti and invites feedback on performance and compatibility. This matters for developers and researchers wanting to run Gemma-4 variants locally or on GPUs, as GGUF is a portable format and the Docker image simplifies setup while compatibility gaps in llama.cpp could affect adoption and reproducibility.
Developers running large local models (e.g., Qwen 3.6) and agent frameworks (Llama.cpp, OpenCode, Pi, Basically all agents) are struggling with context compaction, cache validation, and token/response consistency when stitching multi-turn history across systems. The article details practical setups (model command lines, server ports) and highlights issues around prompt/template propagation, preserving thinking states, and ensuring cache keys reflect real context to avoid stale outputs. It stresses why accurate cache invalidation and deterministic compaction matter for correctness, latency, and safety when agents share histories or rely on partial context. The piece matters because these are core engineering problems for deploying LLM-based agents reliably at scale in research and production.
A pull request was submitted to the ggml-org/llama.cpp repository to add support for the Mimo v2.5 model, expanding the local LLaMA-compatible model ecosystem. The change, contributed by user AesSedai, integrates Mimo v2.5 model specifics into the llama.cpp codebase, enabling inference and compatibility for users running models locally with GGML optimizations. This matters because llama.cpp is a widely used C++ library for running LLMs efficiently on consumer hardware; adding Mimo v2.5 broadens model options for developers, researchers, and hobbyists seeking performant on-device language models without relying on cloud APIs. The update can influence local deployment workflows and tooling in the open-source LLM community.
A community model release named Qwen3.6-27B-uncensored-heretic-v2 Native MTP Preserved is now available on Hugging Face in safetensors, GGUF and NVFP4 formats. The fork preserves all 15 MTPs (model tokens/personalities) and reports a KLD of 0.0021 with 6% refusal rate (6/100), indicating limited safety filtering compared with upstream. The uploader llmfan46 provided download links and multiple format builds for wider compatibility with local inference tools, targeting users who want fewer redactions or to study behavioral changes from safety fine-tuning. This matters for researchers, developers and ops teams balancing model alignment, reproducibility and deployment risks when using community-modified large language models.
A user reports running Qwen 3.6 27B with a 100k context window on an NVIDIA RTX 3090 using an MTP-quantized GGUF build, achieving roughly 50 tokens/sec on llama.cpp. They link to a Hugging Face GGUF model build (RDson/Qwen3.6-27B-MTP-Q4_K_M-GGUF) and reference a local llama-server binary (llama-cpp-am17an) used to serve the model. This is notable for practitioners squeezing large-context inference performance from consumer GPUs via quantized GGUF formats and optimized llama.cpp/server builds. It matters for developers and researchers seeking cost-effective LLM inference with extended context on limited hardware and highlights community model builds and tooling improvements.
A developer combined Multi-Token Prediction (MTP) with Unsloth’s UD XL quantizations to run Qwen3.6-27B from Hugging Face in GGUF format, reporting about 2.5x throughput gains. The build uses an unmerged llama.cpp pull request enabling MTP for quantized models and demonstrates significant speedups on CPU inference without model-merging. This matters because it shows practical performance improvements for running large open models locally using quantized weights and MTP, lowering latency and compute costs for deployments outside GPU-heavy setups. The work impacts developers and startups optimizing LLM inference, and points to upcoming llama.cpp features that could be integrated into mainstream toolchains.
A user reports success running a MPT-format model (an MPT-converted Qwen 3.6 27B with Q4.0 quantization in GGUF) via llama.cpp on an AMD iGPU system with 64GB unified memory, saying latency matches a 9B Qwen 3.5 Q4KM setup. They note surprisingly good performance despite using an integrated GPU, indicating MPT/llama.cpp compatibility and GGUF quantized models can deliver practical inference on modest hardware. This matters because it suggests accessible, lower-cost options for running large open models locally, improving experimentation and deployment options for developers and hobbyists without high-end discrete GPUs.
A list of large language models that will support MTP (multi-token precision) as it is integrated into llama.cpp has circulated, naming DeepSeekv3 OG, DeepSeekv3.2/4, Qwen3.5, GLM4.5+, MiniMax2.5+, Step3.5Flash, and Mimo v2+. The post notes that until native MTP weights are released, users must download Hugging Face weights and convert them to gguf format for local use. The author plans to test qwen3.5-122b or glm4.5-air first. This matters for developers running models locally with llama.cpp because MTP could improve mixed-precision inference and compatibility, while gguf conversion remains a practical step for immediate experimentation.
Llama.cpp MTP support now in beta!