qwen-3.6-27b / llama.cpp / 9070xt

A user reports receiving about 12 tokens/sec running qwen-3.6-27B (gguf, Q3 quant) under llama.cpp on an AMD Radeon 9070XT GPU and asks how to improve inference speed. They show their llama-server command: large context (c=65536), ngl=999, np=1, batch and buffer flags, q4_0 KV caches, threads=6, and other performance flags. This matters because GPU inference on quantized LLMs is sensitive to memory layout, kernel support, and driver/runtime choices; low throughput can stem from suboptimal quant

Latest Changes

TurboQuant KV-cache and multi-turn persisting (MTP) ported to ROCm for RDNA3 in llama.cpp

Luce DFlash and PFlash optimizations deliver 2.23× decode and 3.05× prefill speedups for Qwen3.6-27B on Strix Halo

User reports show very low performance (≈12 tok/s) on 9070XT prompting troubleshooting and config review

Timeline

2026-05-09 — User reports ~12 tokens/sec running qwen-3.6-27B on AMD Radeon 9070XT under llama.cpp and asks for performance advice

2026-05-10 — Hobbyist recounts slow performance traced to reused run scripts and environment mismatch after new PC and ROCm install

2026-05-12 — LocalLLaMA benchmark shows Luce DFlash and PFlash yield 2.23× decode and 3.05× prefill gains for Qwen3.6-27B on an AMD Strix Halo

2026-05-14 — Developer ports TurboQuant TBQ4 KV-cache and multi-turn persisting to ROCm for RDNA3 enabling 64k context use within 24 GB VRAM in llama.cpp

Recent News (4)

Turboquant+MTP for ROCm(Llama CPP)

Developer ported TurboQuant (TBQ4) KV-cache and multi-turn persisting (MTP) to AMD ROCm for RDNA3 (RX 7900 XTX / gfx1100) in llama.cpp, enabling 64k token context to fit and be usable within 24 GB VRAM. The work is in the tbq4-rdna3-experiment branch on GitHub and adapts quantization and memory strategies for ROCm-specific constraints, improving large-context inference on consumer AMD GPUs. This matters because it expands high-context LLM capability to widely available AMD hardware, lowering hardware barriers for long-context applications and broadening cross-platform support in open-source inference stacks. The change targets performance and memory efficiency for open-source model runtimes and could benefit developers building local LLM deployments.

src_reddit_llm/u/DrBearJ3w2h ago

Luce DFlash + PFlash on AMD Strix Halo: Qwen3.6-27B at 2.23x decode and 3.05x prefill vs llama.cpp HIP

Benchmarking from the LocalLLaMA community shows Luce’s DFlash and PFlash optimizations significantly speed up inference for Qwen 3.6-27B on an AMD Strix Halo GPU, reaching 2.23× faster decode and 3.05× faster prefill compared with llama.cpp using HIP. The post demonstrates memory-and-I/O-focused tweaks (DFlash/PFlash) that reduce latency and improve throughput for large LLMs on AMD hardware, highlighting practical gains without changing model weights. This matters because it narrows the performance gap between AMD and NVIDIA inference stacks, lowers costs for local inference, and boosts feasibility of running large open models on consumer/prosumer GPUs. The work is relevant to developers, open-source ML toolers, and hardware-focused ML engineers.

src_reddit_llm/u/sandropuppo1d ago

qwen-3.6-27b / llama.cpp / 9070xt

Why It Matters

Latest Changes

Timeline

What to Watch

Recent News (4)