Loading...
Loading...
Recent developments show an industry push to make large language models more efficient and deployable on limited GPU memory. A 2026 GPU memory guide updates practical calculations for FP16/INT8 quantization, activation checkpointing, sharding, and offloading to fit modern LLMs on consumer and datacenter GPUs, highlighting FlashAttention and quantized kernels. In parallel, Cohere’s open-source Command A+ demonstrates lossless 4-bit W4A4 quantization with attention preserved at full precision, enabling a 218B-parameter MoE to run on one or two GPUs with BF16/FP8 fallbacks. Together these stories underscore a trend toward aggressive, practical quantization and memory engineering to enable high-performance, on-premises LLMs.
Efforts to shrink model memory and preserve performance let teams deploy large, capable LLMs on fewer GPUs or on-premises hardware. Tech pros must adapt infrastructure, tooling, and quantization workflows to leverage these efficiency advances.
Dossier last updated: 2026-05-21 09:16:35
Cohere releases Command A+, a sparse MoE open model built for agentic tasks, with 218B total and 25B active parameters, its first under the Apache 2.0 license (Carl Franzen/VentureBeat)
Carl Franzen / VentureBeat : Cohere releases Command A+, a sparse MoE open model built for agentic tasks, with 218B total and 25B active parameters, its first under the Apache 2.0 license — Canadian AI lab Cohere made waves recently by announcing a merger with German AI startup Aleph Alpha, but now it has even more in store …
A 2026 refresher post outlines GPU memory calculations for running large language models (LLMs) locally, updating model parameter, optimizer, activation, and attention memory estimates for modern GPUs. It walks through memory footprints for popular model sizes and architectures, covering FP16/INT8 quantization, activation checkpointing, tensor parallelism, and batch-size trade-offs to determine which GPUs can host models like LLaMA variants. The guide highlights practical steps — model sharding, offloading to CPU/NVMe, and using FlashAttention or quantized kernels — to reduce VRAM needs, and notes performance vs. memory trade-offs. This matters for developers, researchers, and startups optimizing inference/training costs and enabling local/private LLM deployments.
Cohere released Command A+, a 218-billion-parameter sparse Mixture-of-Experts (MoE) decoder model aimed at complex reasoning, multimodal document processing, and agentic workflows — and published the weights under an Apache 2.0 license on Hugging Face. The model activates only ~25B parameters per generation and ships in BF16, FP8 and a novel W4A4 4-bit quantized format that keeps attention pathways at full precision and uses Quantization-Aware Distillation to avoid accuracy loss. Cohere says Command A+ can run on a single NVIDIA Blackwell B200 or two H100 GPUs, delivers up to 63% higher throughput and 17% lower latency versus its predecessor, and includes a revamped tokenizer with native support for 48 languages. This open, highly optimized release targets enterprise sovereignty and on-prem deployment at frontier-model performance.