Small llama.cpp Changes, Big VRAM Impact for Qwen3.6

Community work around Qwen3.6-27B GGUF in llama.cpp is highlighting how tight VRAM margins are for running 27B models locally. One developer traced higher-than-expected VRAM use in Qwen3.6-27B IQ4_XS builds to a recent llama.cpp change; reverting it reduced full-model allocation by about 400MB, enough to affect whether the model fits on common 16GB-class GPUs, especially with long contexts (tested up to ~110k) and large KV caches. In parallel, users with cards like the RX 7900 XT are tuning llama-server/OpenCode flags—flash attention, sampling, and KV-cache quantization—to squeeze memory while preserving speed and quality.

Recent News (4)

Bleeding Llama: Critical Unauthenticated Memory Leak in Ollama

Cyera Research disclosed a critical unauthenticated memory-leak vulnerability in Ollama (CVE-2026-7482, CVSS 9.1) that can expose the entire Ollama process memory on affected servers — potentially impacting ~300,000 instances. The leak can reveal user prompts, system prompts, environment variables and other sensitive data. Ollama is a popular open-source platform for running LLMs locally and supports model workflows via /api/pull and /api/create; the vulnerability arises during handling of uploaded GGUF model files (uploaded via /api/blobs/sha256:[digest]) and creation of model instances. The issue underscores risks in local LLM hosting, since secrets and user data in memory can be extracted remotely without authentication.

10pts

HNnateb20226h ago

LearningCircuit/local-deep-research: ~95% on SimpleQA (e.g. Qwen3.6-27B on a 3090). Supports all local and cloud LLMs (llama.cpp, Ollama, Google, ...). 10+

~95% on SimpleQA (e.g. Qwen3.6-27B on a 3090). Supports all local and cloud LLMs (llama.cpp, Ollama, Google, ...). 10+ search engines - arXiv, PubMed, your private documents. Everything Local & Encrypted. Language: Python Stars: 45 Forks: 2 Contributors: LearningCircuit

45pts

GitHubLearningCircuit3d ago

Qwen3.6-27B IQ4_XS FULL VRAM with 110k context