Loading...
Loading...
Qwen 3.5 is emerging as a catalyst for the “run it yourself” LLM wave, as communities push larger models onto cheaper hardware through better runtimes, quantization, and tooling. llama.cpp is rapidly expanding—adding new dependencies for Vulkan builds, stabilizing Gemma 4 (including audio in llama-server), and landing work on backend-agnostic tensor parallelism. Meanwhile TurboQuant and related KV-cache techniques are shrinking memory needs enough to run 27B–30B models on 8–16GB GPUs, while AMD/ROCm and Vulkan gains broaden non-CUDA options. New GUIs, GGUF quant tools, and Apple Silicon fine-tuning further lower friction for local, multimodal workflows.
A developer asks whether to buy a Mac or build a custom ‘5090’ workstation for a workflow split between fine-tuning pretrained models and training some models from scratch, with heavy image/video ML and occasional LLM work. They note many projects rely on large pretrained models where VRAM and GPU compatibility matter, and weigh macOS convenience, M-series efficiency, and Apple GPU limitations against the flexibility, driver support, CUDA ecosystem, and raw GPU memory of a custom PC with an NVIDIA 5090-class card. The decision hinges on priorities: native macOS apps and power efficiency versus CUDA-dependent toolchains, larger VRAM for big models, and upgradeability for long-running research. Cost, software support, and model scale determine the better choice.
wuwangzhang1216/abliterix: Automated alignment adjustment for LLMs — direct steering, LoRA, and MoE expert-granular abliteration, optimized via mul
A user reports successfully running the Qwen3.5-35B model (unsloth Qwen3.5-35B-A3B-UD-Q4_K_L) with llama.cpp on a Windows 11 workstation (i7-13700F, 64GB RAM) and an NVIDIA RTX 4060 Ti 16GB, achieving about 60 tokens/sec and stable 64k context. They share configuration details and model.ini tweaks to enable GPU memory-efficient quantized inference, indicating practical optimizations for consumer GPUs to host large open models locally. This matters because it shows that mid-range GPUs can run large 35B models with careful tuning and quantization, lowering the barrier for developers and researchers to experiment with powerful LLMs without cloud costs.
Qwen 3.5 35B is highlighted as a top-performing local LLM, praised for delivering capabilities comparable to larger models despite its 35-billion-parameter size. Community users on Reddit’s LocalLLaMA forum report strong inference quality and responsiveness when running the model locally, noting efficient resource use that makes it practical for desktops and small servers. The model’s balance of performance and footprint matters for developers, researchers, and privacy-conscious users who need powerful on-device AI without cloud dependence. This makes Qwen 3.5 35B relevant for local deployment, edge AI use cases, and teams evaluating alternatives to much larger, cloud-hosted models.
Turbo Quant briefly trended about two weeks ago after community contributors submitted pull requests to llama.cpp to add or improve quantization support; since then, discussion has largely quieted and the original Reddit poster is asking for an update. Key players include the open-source llama.cpp project and contributors exploring lower-precision quantization formats to speed local inference. Why it matters: efficient quantization like Turbo Quant can reduce model size and improve CPU/GPU inference speed, enabling broader local deployment of LLMs without specialized hardware. The current status appears to be that work is ongoing in community forks and PRs, but no major upstream release or universal adoption has been announced.
A deep-dive by oobabooga examines GGUF conversions of Meta’s Gemma 4 and Qwen 3.5 models, detailing file sizes, tokenization, performance trade-offs, and compatibility with local LLM runtimes. The analysis compares precision formats, memory footprints, and loading times, highlighting practical implications for running these large models on consumer hardware and inferences about latency and VRAM needs. Key players include Gemma 4, Qwen 3.5, and the oobabooga tooling/community; the write-up matters because GGUF is becoming a standard container for offline deployment, influencing accessibility of advanced models outside cloud services. The piece helps developers and hobbyists optimize local setups, choose quantization strategies, and understand interoperability across open-source inference stacks.
The llama.cpp project updated its Vulkan backend to require the SPIR-V headers package; users building or running Vulkan-accelerated local LLaMA models must now install SPIR-V headers (e.g., spirv-headers) to compile. Community reports on Reddit prompted discussion after builds failed until the dependency was added to install instructions and packaging. This change affects developers and hobbyists who use llama.cpp with GPU acceleration on Vulkan-capable systems, making setups slightly more complex but aligning build requirements with Vulkan shader compilation needs. It matters because clearer dependency management reduces build breakage and improves reproducibility for local AI inference tooling.
A Reddit user reports running llama.cpp on a 4 GB RAM Chromebook, demonstrating that lightweight on-device inference for Llama-family models is possible on low-end hardware. The post includes a screenshot and links to the LocalLLaMA subreddit, suggesting community interest and practical tips for setup. This matters because it highlights accessibility of local, privacy-preserving AI inference on inexpensive devices, lowering barriers for hobbyists and developers experimenting with LLMs without cloud costs. It also underscores trade-offs such as model size, performance, and potential need for swap or model quantization to fit memory-constrained systems. The example signals growing demand for optimized runtimes and tools for running LLMs on edge devices.
Audio processing landed in llama-server with Gemma-4
Gemma 4 + llama.cpp: audio processing landed in llama-server
Taking on CUDA with ROCm: 'One Step After Another'
Simon Willison demonstrated transcribing audio on macOS using the 10.28 GB Gemma 4 E2B model with MLX and the mlx-vlm tool via an uv run command. He shared a concise recipe that installs mlx_vlm, torchvision, and gradio, then calls mlx_vlm.generate with a prompt to transcribe a WAV file. Testing on a 14-second clip, Gemma 4 produced a near-accurate transcription with minor mishearings (“front” for “right” and “how that works” instead of “how well that works”). The post highlights practical local usage of an LLM-based VLM for speech-to-text, showing Gemma 4’s capability and current limitations for developers experimenting with on-device or lightweight multimodal transcription. Key players: Gemma 4, MLX, mlx-vlm, uv.
Unsloth Studio (Beta) launches as an open-source, no-code local web UI to run, train and export GGUF and safetensor models across Windows, macOS, Linux and WSL. It supports running models locally (via llama.cpp and Hugging Face), multi-GPU inference, and CPU/mac chat-only inference, while offering no-code training kernels optimized for LoRA, FP8 and other techniques to fine-tune 500+ text, vision, TTS and embedding models (including Qwen3.5 and NVIDIA Nemotron 3). Features include Data Recipes to auto-create datasets from PDFs/CSV/JSON, observability dashboards for training metrics, model comparison Arena, export to safetensors/GGUF, and privacy-focused offline usage with token-based auth. The beta notes installation limitations (llama.cpp compilation) and upcoming improvements like precompiled binaries, broader hardware support, and a Docker image.