How to get 3k tokens/sec single‑request LLM decoding on commodity GPUs — and why it matters now

By yrzheMay 29, 20267 min read

# How to get 3k tokens/sec single‑request LLM decoding on commodity GPUs — and why it matters now

Yes—~3,000 tokens/sec per request is achievable on standard datacenter GPUs today, but only if you optimize for single-request decoding (batch=1) and treat autoregressive generation as a memory‑bandwidth problem. In a public tech preview, Kog AI reports ~3,000 output tokens/sec on 8× AMD MI300X and ~2,100 tokens/sec on 8× NVIDIA H200 for a 2B dense FP16 model, with no speculative decoding. The catch is that these numbers come from a tightly co-designed model/runtime/kernel stack aimed at saturating memory bandwidth and minimizing KV‑cache traffic—not from “turning a knob” in a general-purpose batching-first inference server.

The mechanism: why decoding becomes bandwidth-bound

The practical reason single-request decoding is hard to accelerate is that each next token forces you to read and write a growing amount of state—especially the key/value cache used by attention. For batch=1, you can’t “hide” this cost behind other requests the way throughput-optimized stacks do. Multiple studies and recent systems analyses point to the same set of bottlenecks in transformer decoding—memory bandwidth, memory capacity, compute capacity, and synchronization overhead—where KV-cache movement is often dominant for latency-first generation, especially as context grows.

Builder consequence: if you’re trying to improve tokens/sec for one live user, you usually get farther by reducing memory movement (cache size, cache precision, copies, syncs) than by chasing more FLOPS.

The two metrics that actually describe UX

For interactive agents, coding assistants, and streaming chat, “throughput” is often the wrong headline metric. Two metrics matter more:

Single-request decoding speed (tokens/sec at batch=1): how fast you can keep streaming once generation starts.
TTFT (time to first token): how long users wait before they see anything.

NVIDIA’s LLM benchmarking guidance explicitly treats TTFT separately from steady-state token generation, because systems can look “fast” in tokens/sec while still feeling sluggish if prompt processing, scheduling, or runtime overhead delay first output.

Builder consequence: if you only optimize tokens/sec, you can still lose the product war on perceived latency. Instrument TTFT and tokens/sec separately, and don’t accept “kernel speedups” without end-to-end timing.

What Kog’s preview implies: co-design beats batching-first stacks

Kog’s core claim is not that MI300X or H200 suddenly became magical; it’s that typical inference stacks are tuned for aggregate throughput via batching, leaving substantial batch=1 performance unused. Their preview positions single-request decoding as primarily limited by memory traffic and synchronization, and argues for co-design across model architecture, runtime scheduling, and low-level kernels to approach hardware ceilings.

They also report (via Kog/AI in Use writeups) up to ~3.5× faster token generation versus common stacks like vLLM and TensorRT in some evaluations, emphasizing latency-sensitive workloads.

Builder consequence: “faster model serving” isn’t one layer. If your runtime adds copies, synchronization points, or batching-oriented scheduling, it can erase gains from better kernels.

Practical levers you can actually pull (and what each buys you)

If you’re trying to reproduce the shape of these gains—even if you can’t reproduce Kog’s exact stack—there are a few levers that map directly to the bandwidth thesis:

Kernel/runtime alignment: reduce data copies, fuse operations where possible, and avoid unnecessary synchronization—especially across devices. The goal is to keep the GPU fed while minimizing memory round-trips. This is where synchronization overhead can quietly dominate at batch=1.
Reduce KV-cache pressure: smaller caches move less data. Recent techniques discussed in the literature include KV-cache quantization in the 2–4 bit range to shrink bandwidth and memory footprint, and methods like BitDecoding that aim to exploit tensor-core-friendly paths for long contexts.
Precision strategy: Kog’s headline numbers use FP16 on a 2B dense model. Their broader argument (including community back-and-forth) is that moving to FP8 or mixed quantization changes the effective memory footprint and bandwidth needs, which is what you must win to raise single-request decode speed.

Builder consequence: the “best” lever depends on whether your bottleneck is bandwidth, capacity, or cross-device overhead. You need profiling that tells you which wall you’re hitting.

Tradeoffs and reproducibility: what breaks when you scale the model

The most important caveat in the Kog preview is the benchmark scope: a 2B dense FP16 model is much easier to optimize and easier to make look good than frontier-scale dense models. Community commentary (including on Hacker News) pushed on fairness and comparability: small-model demos don’t automatically translate to 70B–120B deployments, and missing end-to-end metrics (TTFT, prompt overhead, networking) can hide real costs.

Kog’s response (per their blog and discussion) is essentially: model choice was for implementation clarity, and large models—especially MoE variants—may have a smaller set of active parameters per token. They provide example math where a 120B-class MoE variant might have only ~5.1B active parameters, suggesting that with FP8 and other compression, the active footprint could land in the same ballpark as the tested 2B FP16 setup, making >1k tokens/sec a plausible target on MI300X/H200 with more work.

Builder consequence: you can’t extrapolate from “2B dense FP16” to “frontier dense” without accounting for active parameters, KV-cache size, precision, and multi-GPU communication costs.

Why It Matters Now

A public tech preview that shows ~3,000 tokens/sec per request on standard GPUs shifts the builder conversation from “real-time decoding needs exotic hardware” to “real-time is a systems problem.” The immediate impact is practical: it gives teams a concrete performance reference point for batch=1 decoding, and it reframes optimization work around memory-bandwidth ceilings and end-to-end latency measurement, not just model choice.

It also lands amid active community scrutiny: the debate isn’t whether the kernels are fast, but whether the numbers hold under realistic agent workloads—long prompts, tool calls, and networked orchestration. If you’re building agentic systems (especially multi-step ones—see Claude Code goes dynamic — practical wins for agent builders), faster single-request decoding directly reduces loop time per step, which is often what users experience as “the agent is thinking.”

How this changes agent design and cost models

When per-request decoding gets materially faster, the limiting factor in agent UX often shifts to everything around decoding: prompt assembly, tool latency, scheduling, and communication overhead. That pushes two design moves:

Use latency budgets explicitly: measure TTFT and tokens/sec for each step of your agent loop, not just the model call. If your orchestration adds synchronous round trips, you may squander the gain.
Reframe cost: instead of optimizing for “tokens/sec per GPU at high batch,” track cost per interactive step (time and GPU occupancy per agent turn at batch=1), because that’s the unit users feel.

If you’re also experimenting with hooks or automation inside developer tooling, beware that shaving model decode time can expose new bottlenecks or failure modes in the orchestration layer; sandboxing and safe integration become more important as loops speed up (related: What Claude Code’s undocumented hooks really do — and how to sandbox them safely).

A checklist to try this yourself this month

Start by measuring: record TTFT and batch=1 tokens/sec on your current stack for representative prompts (including your longest contexts).
Validate the bottleneck: profile memory bandwidth utilization and look for stalls tied to KV-cache reads/writes and cross-device synchronization.
Reduce KV-cache footprint: try KV-cache quantization (2–4 bit) where supported, and evaluate quality impact on your workload.
Simplify the serving path: avoid batching-first scheduling and excess framework overhead for your latency-critical endpoints.
Re-benchmark end-to-end: include network, prompt processing, and streaming. Kernel peaks don’t equal user-perceived speed.

What to Watch

Whether Kog publishes more reproducible, end-to-end benchmarks (including TTFT and prompt overhead) across more model classes beyond 2B dense FP16.
How quickly KV-cache quantization and long-context decoding techniques (including BitDecoding-style approaches) become robust and widely usable in production stacks.
Whether the next wave of “fast decoding” work is constrained more by multi-GPU communication and scheduling overhead than by raw HBM bandwidth.

Sources: blog.kog.ai , news.ycombinator.com , aiinuse.org , docs.nvidia.com , arxiv.org , arxiv.org

About the Author

yrzhe

AI Product Thinker & Builder. Curating and analyzing tech news at TechScan AI. Follow @yrzhe_top on X for daily tech insights and commentary.

X/Twitter GitHub Blog