Loading...
Loading...
Community work around Qwen3.6-27B GGUF in llama.cpp is highlighting how tight VRAM margins are for running 27B models locally. One developer traced higher-than-expected VRAM use in Qwen3.6-27B IQ4_XS builds to a recent llama.cpp change; reverting it reduced full-model allocation by about 400MB, enough to affect whether the model fits on common 16GB-class GPUs, especially with long contexts (tested up to ~110k) and large KV caches. In parallel, users with cards like the RX 7900 XT are tuning llama-server/OpenCode flags—flash attention, sampling, and KV-cache quantization—to squeeze memory while preserving speed and quality.
Cyera Research disclosed a critical unauthenticated memory-leak vulnerability in Ollama (CVE-2026-7482, CVSS 9.1) that can expose the entire Ollama process memory on affected servers — potentially impacting ~300,000 instances. The leak can reveal user prompts, system prompts, environment variables and other sensitive data. Ollama is a popular open-source platform for running LLMs locally and supports model workflows via /api/pull and /api/create; the vulnerability arises during handling of uploaded GGUF model files (uploaded via /api/blobs/sha256:[digest]) and creation of model instances. The issue underscores risks in local LLM hosting, since secrets and user data in memory can be extracted remotely without authentication.
~95% on SimpleQA (e.g. Qwen3.6-27B on a 3090). Supports all local and cloud LLMs (llama.cpp, Ollama, Google, ...). 10+ search engines - arXiv, PubMed, your private documents. Everything Local & Encrypted. Language: Python Stars: 45 Forks: 2 Contributors: LearningCircuit
A developer found that Qwen3.6-27B IQ4_XS quantized GGUF builds use more VRAM than Qwen3.5-27B due to a llama.cpp commit; reverting that change reduced full-model VRAM from ~15.1GB to ~14.7GB, reclaiming about 400MB and nearly 16GB when extrapolated across memory layout differences. The author ran KV cache tests and profiling to compare memory usage and token cache behavior, pinpointing code-path differences that increased peak allocation. This matters for running large models on constrained GPUs and for efficient quantized distributions used by the community; small engine-level changes can noticeably affect deployment costs and capability. The post highlights practical tuning of inference runtimes and quantization artifacts.
A user asks for advice on running Qwen 3.6 27B in gguf format with limited VRAM using Llama-Server/OpenCode settings. They show a command line that starts llama-server with Qwen3.6-27B-IQ4_XS.gguf, custom sampling (top-p 0.95, top-k 20, temperature 0.6), flash-attn enabled, and q8_0 quantized KV cache; the snippet cuts off mid-parameter. The issue is VRAM scarcity and optimization of cache/quantization and inference flags to fit the large 27B model locally. This matters to developers and practitioners running large open-weight LLMs on constrained GPUs or CPUs, as choices like quantization type, flash attention, and server parameters directly affect memory footprint, latency, and model fidelity.