Loading...
Loading...
Across many community benchmarks and user reports, Qwen‑3.6 consistently outperforms Gemma4 for local deployments thanks to better runtime compatibility, quantization flexibility, and practical performance on consumer GPUs. Users running Qwen‑3.6 with optimized backends (llama.cpp, ik_llama.cpp, BeeLlama, vllm) and mixed-precision/kv quant strategies (q8_0, IQ variants) achieve higher token throughput, larger context windows, and stronger prompt adherence. Multi‑GPU sharding, MTP support, and efficient KV cache handling further boost Qwen’s real‑world speed. By contrast, Gemma4 often lags in throughput or requires different toolchains, making Qwen‑3.6 the more pragmatic choice for local, privacy‑sensitive, and cost‑conscious developers and hobbyists.
Local developers and infra engineers prioritize models that run efficiently on consumer GPUs and integrate with existing backends; Qwen-3.6's practical runtime and quantization advantages lower cost and complexity for privacy-sensitive deployments. Understanding these differences guides tooling, hardware choices, and deployment strategies for on-device LLM applications.
Dossier last updated: 2026-05-20 16:29:59
A user reports performance tuning headaches running large GGUF models like Qwen3.5-35B-A3B with the latest llama.cpp on macOS, seeing roughly 1,500 tokens/sec for prompt encoding but only 35–50 tokens/sec for generation. They’re spending more time tweaking llama.cpp settings for a 100k-context goal than on actual inference, seeking the ideal configuration for throughput and memory use. This matters because optimizing CPU/GPU inference settings, quantization, thread affinity, and memory-mapped loading can drastically affect real-world latency and feasibility of very long-context local LLM deployments. The post highlights the tooling gap for accessible, reliable presets and benchmarking guidance for large GGUF models on macOS.
Benchmark results comparing NTP and MTP quantization for Qwen 3.6 35B in GGUF format show performance and compatibility differences across GPUs and CPUs. The Reddit-sourced table reports token throughput and memory behavior for both quant schemes, highlighting platform-specific trade-offs: NTP may offer better raw speed on certain GPUs while MTP can reduce memory and improve CPU inference in some cases. This matters for developers deploying large language models in constrained environments or on diverse hardware, influencing choices of quantization for latency, memory footprint, and accuracy. The findings help practitioners pick quant formats and settings when running Qwen 3.6 35B locally or in production on mixed accelerators.
A benchmark tested MTP (Multi-Token Prediction) in mainline llama.cpp against Qwen 3.6 35B MoE on an RTX 5080 16GB GPU using long coding-agent contexts up to 128k. The author ran three configurations and found the best performance came from a 35B Q4_K_XL quantized model without MTP, using --fit-target 1536, achieving about 56 tokens/sec. MTP did not improve throughput at realistic long-context settings and introduced trade-offs in memory fit and speed. This matters for developers optimizing local inference and agent workloads: quantization and fit-target tuning can outperform MTP for large MoE models on consumer GPUs, affecting deployment choices in cost-sensitive, low-memory environments.
Users running Qwen 3.6 27B on 16GB VRAM are sharing practical quantization and performance tips for on-device use. The thread’s author reports targeting >50 tg and >800 pp for a home-assistant voice setup, offloading the vision model to CPU to save GPU memory, and experimenting with Qwen3.6-27B-Q3_K quantization and MTP (multi-precision) strategies on an RTX 5080. They contrast Qwen 3.5 9B’s speed with the larger 27B’s higher capability and discuss trade-offs between quant levels, speed, and model intelligence on limited VRAM. This matters to developers and hobbyists optimizing large LLMs for edge/desktop inference and low-latency voice applications.
A user reports that Qwen 3.6 with MTP spec decoding fails on an NVIDIA Tesla P40 when the K (key) attention cache is quantized. They achieved 20 tokens/sec running a 27B Q5 quantized Qwen 3.6 model on the P40 only after disabling quantization for the K cache (using float16). Turbo3 K cache runs fine without MTP on a turboquant fork of llama.cpp, but using an atomic fork to enable MTP produced invalid outputs unless the K cache remained dequantized. This matters for practitioners trying to run modern MPT-style decoding and quantized large models on older GPUs: it suggests a compatibility or implementation bug around quantized K caches and MTP in certain forks, impacting throughput and memory trade-offs.
A user reports strong agentic coding performance from the Qwen-35B-a3b model running locally. They run the model quantized (q80) with key-value cache in q8_0 across an NVIDIA RTX 4090 and a GTX 1650(?) or 5060 Ti, using the llama.cpp backend and Claude-compatible client code pointing to localhost. The setup is used for demos and data analytics and appears to outperform prior local models for the user, though they haven’t tested it on very large codebases. This matters because efficient quantized deployment of large open models on consumer GPUs lowers the barrier for local development, privacy-preserving inference, and cost-effective experimentation for developers and startups.
Benchmarking shows Qwen 3.6 27B can run effectively on a 24GB RTX 3090 when using ik_llama.cpp with the Qwen3.6-27B-MTP-IQ4_KS.gguf quantized model. The tester achieved a 156k context window, used q8_0 KV quantization, enabled MTP and ran vision on CPU, delivering about 1,261 tokens/sec for a ~5.9k-token prompt with a 1k-token output. The article compares backends—llama.cpp, ik_llama.cpp, ik_llama.cpp forks like BeeLlama, and vllm—discussing quantization trade-offs, memory strategies, and configuration tweaks to fit the model in 24GB VRAM while preserving latency and throughput. This is useful for developers and hobbyists optimizing large LLM inference on consumer GPUs.
A user on LocalLLaMA reports running Qwen 3.6 27B quantized to Q8 across four Nvidia RTX A4000 GPUs (16GB each) using llama.cpp with MTP enabled. The post details model setup, memory footprint, and performance trade-offs when sharding the quantized model across consumer workstation GPUs instead of larger datacenter cards. This matters because it shows practical pathways for running large open-weight models on modest multi-GPU rigs, lowering the hardware bar for local inference and experimentation. Key players include the Qwen model family, llama.cpp runtime, and Nvidia A4000 hardware; implications touch on democratizing access to large LLMs and the role of quantization and model-parallel sharding in cost-effective deployment.
A Reddit user tested llama.cpp's MTP (mixed-precision tensor processing) support with the Qwen 3.6 model on an NVIDIA RTX 5090 GPU, demonstrating successful runtime and performance behavior. The post highlights compatibility work between the open-source inference engine llama.cpp and a large Qwen model, showing benchmark outputs, memory usage, and token generation characteristics. This matters because it showcases community-driven efforts to run large commercial/open models efficiently on consumer high-end GPUs using optimized libraries, lowering barriers for local inference and experimentation. The test indicates progress in hardware utilization and model support, relevant for developers, researchers, and hobbyists aiming to run advanced LLMs locally.
Users report performance drops running Qwen 3.6 35B on dual NVIDIA 3090 GPUs after an MTP merge changed layer handling. Previously some achieved ~1500 tokens per second (p/s) and 120 tokens per GPU (t/g) with split layers; after MTP merge one tester saw throughput fall to ~80 t/g. The poster currently uses a CPU overflow fallback achieving ~3500 p/s and 80 t/g and asks the community for optimized settings or configs (e.g., split-layer tricks) to regain speed similar to club 3090’s 27B results. This matters to practitioners balancing model size, latency and GPU memory/compute limits when running large local LLMs.
A user asks for recommended settings to run Qwen 27B (Qwen 3.6-27B) on a single NVIDIA RTX 3090 using llama.cpp/llama-server and shares a working invocation that frequently compacts and uses aggressive quantization. The provided command runs a GGUF model (Qwen3.6-27B-Q5_K_S.gguf) with a 64K context (-c 65536), single GPU thread mapping (-ngl -1), 8 threads (-t 8), and q8_0 for context-token and context-value quantization (-ctk q8_0 -ctv q8_0), plus chat-template kwargs. The author is concerned about accuracy and reliability trade-offs from lower-precision quant formats and compacting behavior on the 3090. This matters for practitioners balancing model size, VRAM, latency, and fidelity when running large open-weight LLMs locally.
A user with a 24 GB GPU asks whether to run Qwen 3.6 27B quantized as IQ3XXS KV Q8 or Q4XL KV Q4 to support a 262K-token context for a Hermes agent. Both quantization setups are UD (unsloth) quants and reportedly fit in VRAM. The user notes LM Studio requires using the same V and K quant formats to avoid high CPU usage, and has heard that Qwen 3.6 27B performs well even with Q4 KV. This matters because choosing the right quantization affects model quality, latency, memory use, and CPU offload when running very long contexts on limited GPU hardware.
A Reddit thread titled “club-5060ti: practical RTX 5060 Ti local LLM notes and configs” collects hands-on tips for running local large language models on NVIDIA RTX 5060 Ti GPUs. Contributors share model choices, quantization settings, memory/VRAM tricks, inference runtimes, and configuration files to fit common LLMs within the 12–16 GB VRAM constraints. The post matters because it documents practical, community-driven techniques that enable affordable consumer GPUs to host private or offline LLMs, lowering barriers for developers and hobbyists working on local AI deployments. It highlights trade-offs between model size, speed, and accuracy, and points to tooling (quantizers, runtimes) and workflows used to squeeze larger models onto midrange hardware.
A user compared Qwen-3.6 35B-a3B to Gemma4 26B-a4B and reports that running Qwen-3.6 through llama.cpp produced much faster performance and roughly equivalent general intelligence, with better prompt adherence and no slowdown on long contexts. The poster had previously tried Qwen-3.6 via Ollama on their PC and felt that Ollama underperformed, suggesting runtime choice affected perceived quality. This matters for developers and hobbyists choosing local LLM runtimes: model performance can be tightly coupled to the toolchain (llama.cpp vs Ollama), and Qwen-3.6 appears competitive with leading open models when run with an optimized local backend. It highlights trade-offs in local inference speed and prompt fidelity.
User asks which LLM setup is most stable for running locally on a 32 GB RAM MacBook Pro M2 Max with 256k context. They’ve experimented with Gemma4 and Qwen 3.6 and want recommendations on inference software (e.g., oMLX, llama.cpp), model + quantization choices, and optimal settings for agentic workflows. The question centers on balancing model size, quant formats (4-bit/8-bit), and runtime tools that support long contexts and Apple Silicon optimizations. This matters because developers and power users need practical guidance to run large-context models locally without exceeding memory, preserving responsiveness, and maintaining accuracy for multi-step agent tasks.
A user compared local LLMs for coding and image data extraction, reporting strong results with Qwen 3.6 but underwhelmed by Meta's Gemma 4. They run quantized Qwen models (Q5 31B, Q8 27B) at reasonable speed with KV cache, while Gemma4 felt worse in throughput or quality. The discussion centers on practical local deployment trade-offs: model size, quantization format, latency, and task fit for coding and multimodal extraction. This matters to developers and teams choosing local models for productivity, cost, and privacy, highlighting that cutting-edge flagship models may not always deliver better real-world results than lighter, optimized alternatives.