Cohere’s open model and GPU memory strategies meet quantization trend

Recent developments show an industry push to make large language models more efficient and deployable on limited GPU memory. A 2026 GPU memory guide updates practical calculations for FP16/INT8 quantization, activation checkpointing, sharding, and offloading to fit modern LLMs on consumer and datacenter GPUs, highlighting FlashAttention and quantized kernels. In parallel, Cohere’s open-source Command A+ demonstrates lossless 4-bit W4A4 quantization with attention preserved at full precision, enabling a 218B-parameter MoE to run on one or two GPUs with BF16/FP8 fallbacks. Together these stories underscore a trend toward aggressive, practical quantization and memory engineering to enable high-performance, on-premises LLMs.

Latest Changes

Cohere open-sources Command A+ (218B total, 25B active) under Apache 2.0 for agentic tasks

Command A+ demonstrates lossless 4-bit W4A4 quantization with full-precision attention preserved

2026 GPU memory guide updates practical FP16/INT8 math, activation checkpointing, sharding, and offloading

Guide highlights FlashAttention and quantized kernels as key enablers for tighter memory footprints

Timeline

2026-05-20 — GPU Memory Math 2026 edition published updating memory estimates and strategies for modern LLMs

2026-05-20 — Announcement that Cohere achieved lossless quantization and native citations for Command A+

2026-05-21 — Cohere releases Command A+, a 218B-parameter sparse MoE model with 25B active params under Apache 2.0

Recent News (4)

Cohere releases Command A+, a sparse MoE open model built for agentic tasks, with 218B total and 25B active parameters, its first under the Apache 2.0 license (Carl Franzen/VentureBeat)

src_agent-collectrss-techmeme4h ago

Cohere releases Command A+, a sparse MoE open model built for agentic tasks, with 218B total and 25B active parameters, its first under the Apache 2.0 license (Carl Franzen/VentureBeat)

Carl Franzen / VentureBeat : Cohere releases Command A+, a sparse MoE open model built for agentic tasks, with 218B total and 25B active parameters, its first under the Apache 2.0 license — Canadian AI lab Cohere made waves recently by announcing a merger with German AI startup Aleph Alpha, but now it has even more in store …

src_techmeme9h ago

GPU Memory Math for LLMs (2026 Edition)

A 2026 refresher post outlines GPU memory calculations for running large language models (LLMs) locally, updating model parameter, optimizer, activation, and attention memory estimates for modern GPUs. It walks through memory footprints for popular model sizes and architectures, covering FP16/INT8 quantization, activation checkpointing, tensor parallelism, and batch-size trade-offs to determine which GPUs can host models like LLaMA variants. The guide highlights practical steps — model sharding, offloading to CPU/NVMe, and using FlashAttention or quantized kernels — to reduce VRAM needs, and notes performance vs. memory trade-offs. This matters for developers, researchers, and startups optimizing inference/training costs and enabling local/private LLM deployments.

Cohere’s open model and GPU memory strategies meet quantization trend

Why It Matters

Latest Changes

Timeline

What to Watch

Recent News (4)