Gemma 4 GGUF Updates and Lightweight Donor GGUFs

Developers working with Gemma 4 and GGUF formats should update their local model files: patched Gemma 4 GGUF builds addressing a Chat Template bug are now available on Hugging Face for the 31B and 26B variants. Parallel tooling advances ease local model grafting—an author released a utility to extract only the necessary MTP tensors from GGUF donor files, producing compact ~900 MB faux-GGUF artifacts for use with MTP grafting scripts. Together, these updates reduce friction for local deployment and model merging workflows, cutting storage and transfer overhead while ensuring stabilized chat-template behavior for Gemma 4 users.

Latest Changes

Patched Gemma 4 GGUF builds addressing a Chat Template bug are now available on Hugging Face for 31B and 26B variants

A contributor released a tool to extract only MTP tensors from GGUF donor files, producing ~900 MB faux-GGUF artifacts

Grafting scripts can now operate without full GGUF donors, lowering storage and transfer overhead

Timeline

2026-05-04 — Announcement urging users to update Gemma 4 GGUF files after a Chat Template fix and pointing to Hugging Face uploads

2026-05-05 — Follow-up notice reiterates that Gemma 4 GGUFs should be updated due to the fixed Chat Template issue

2026-05-07 — Contributor released a lightweight extractor to create ~900 MB faux-GGUF donor files containing only MTP tensors

2026-05-11 — Related community models demonstrate continued optimization efforts for large-context workloads on limited VRAM

Recent News (4)

500k context on 48gb VRAM!! - 21tok/s (coding)

A user discovered a tuned version of Nemotron on Hugging Face—Nemotron-3-Super-64B-A12B-Math-REAP-GGUF—that claims to run large-context workloads efficiently on 48 GB of VRAM, achieving about 21 tokens/sec for coding and supporting extremely long (500k-token) context. The model is presented as a math-focused, distilled/tuned variant intended to emulate parts of the larger 12B Nemotron Super but with far lower resource requirements. This matters because compact, optimized model builds and GGUF packaging can enable researchers and developers to run near-large-model capabilities on desktop GPUs, lowering the barrier for experimenting with long-context agentic use cases and coding assistance. Key players: Hugging Face hosting, Max-and-Omnis as the uploader, and the Nemotron family of models.

src_reddit_llm/u/Express_Quail_14931h ago

Extracted MTP tensor GGUFs - smaller donor models for grafting.

A contributor created a lightweight tool to extract MTP tensors from GGUF model files so the grafting script no longer needs a full GGUF donor. The result is two compact "faux GGUF" files (~900 MB) designed to contain only the tensors required for MTP grafting, with a Hugging Face upload provided as an example. This matters because smaller donor files reduce storage and transfer overhead for local model grafting workflows, making experiments with MTP-based model merging more accessible to developers and hobbyists. The post links the extraction script and the reduced GGUF artifacts, enabling easier reuse in local model modification pipelines.

src_reddit_llm/u/AzerbaijanNyan3d ago

Gemma 4 GGUF Updates and Lightweight Donor GGUFs

Why It Matters

Latest Changes

Timeline

What to Watch

Recent News (4)