Local LLMs: preserve_thinking, parser bugs, and new checkpoints

Developers running local LLMs face linked issues around token-level streaming, server parsing, and new community checkpoints. Users ask whether OpenWebUI honors the preserve_thinking flag or if front-ends rely on back-end servers to expose unpruned “thinking” tokens for live streaming and debugging. A llama-server bug with chat-template-kwargs JSON whitespace can silently disable preserve_thinking for Qwen3.6, underscoring the need to validate config strings. Meanwhile, z-lab’s new gemma-4-26B release has prompted compatibility and performance queries as hobbyists test quantization and inference trade-offs. Together these stories show the interplay of UI support, server robustness, and model releases for local deployment workflows.

Why It Matters

Local LLM workflows depend on coherent behavior across front-ends, servers, and model checkpoints; failures in any layer break streaming, debugging, and performance tuning. Tech professionals managing local deployments need to verify client-server interactions and validate configs when integrating new community checkpoints and inference tricks.

Latest Changes

Community question whether OpenWebUI honors preserve_thinking or defers to back-end servers for unpruned tokens

llama-server bug: extra spaces in chat-template-kwargs JSON can silently disable preserve_thinking for Qwen3.6

z-lab released gemma-4-26B-A4B-it-DFlash, prompting hobbyist tests on quantization and DFlash inference trade-offs

Timeline

2026-05-08 — z-lab released the gemma-4-26B-A4B-it-DFlash checkpoint and community members began testing it

2026-05-11 — Users discussed whether OpenWebUI implements preserve_thinking or relies on back-end servers to expose unpruned thinking tokens

2026-05-11 — PSA posted warning that extra spaces in chat-template-kwargs JSON can cause llama-server to ignore preserve_thinking for Qwen3.6

2026-05-13 — Community asked if DFlash and PFlash techniques are compatible with Heretic models and other new checkpoints

What to Watch

Whether OpenWebUI adds explicit client-side support for preserve_thinking or documents reliance on server behavior

Fixes or mitigations in llama-server for robust JSON parsing of chat-template-kwargs to avoid disabling preserve_thinking

Early benchmarking and quantization reports for gemma-4-26B-A4B-it-DFlash, especially with DFlash/PFlash pipelines

Recent News (4)

Q: Does DFlash (and PFlash) work with Heretic models?

Researchers and practitioners have combined techniques like DFlash/PFlash (multi-model pipelines that use smaller models for prefill or distillation) to speed up generation, and the question is whether Heretic-style “smart ablation” tools that can decensor or remove safety filters would interoperate with those multi-model speedups. The key players mentioned are Z-Lab (work on output speedups), Luce (using smaller family models to accelerate prefill), and model families like Qwen 3.6 and Gemma 4 that have smaller variants suited to PFlash. Why it matters: mixing model acceleration methods with tools that alter model behavior raises compatibility, safety, and ethical concerns while promising large (5–10x) latency improvements for inference. The post urges broader adoption of PFlash given available smaller models.

src_reddit_llm/u/TomLucidor1h ago

Does 'preserve_thinking' work with openwebui?

Users on the LocalLLaMA subreddit asked whether the 'preserve_thinking' setting works with OpenWebUI when running local LLMs. The discussion centers on integrating a client-side option that keeps the model’s "thinking" tokens visible in the UI instead of pruning them, affecting response streaming and editing behavior. Participants mention OpenWebUI, preserve_thinking, and Local LLaMA as key elements, with troubleshooting focused on whether the front-end honors the flag or if back-end server implementations (like text-generation-webui or LLM server endpoints) need support. This matters for developers and hobbyists running local inference who want accurate token-level streaming, debugging visibility, and consistent UI behavior across interfaces.

src_reddit_llm/u/sterby921d ago