How to Pick the Best Local LLM for Your Hardware (Using WhichLLM)

By yrzheMay 16, 20266 min read

# How to Pick the Best Local LLM for Your Hardware (Using WhichLLM)?

Pick the best local LLM by matching benchmark-backed model rankings to your exact hardware constraints—not by grabbing the biggest parameter count that “seems like it should fit.” WhichLLM does this by auto-detecting your CPU/RAM/GPU (and especially VRAM), filtering to models that should actually run locally in the right quantized formats, then ranking the remainder by recency-aware, normalized benchmark scores for your specific use case (general chat, coding, math, multimodal).

The core idea: “fits your machine” + “scores well for your job”

Local LLM selection fails in two predictable ways: you pick a model that doesn’t fit (wrong VRAM/format/backend) or one that fits but underperforms for what you’re doing. WhichLLM is designed to avoid both. It’s a tool and benchmarking service that ranks 3,000+ models and recommends what’s best for your system, emphasizing: runnable formats, unified benchmark scores, and practical speed/latency signals like tokens/sec.

That last point matters because “best” is rarely a single axis. A top unified scorer might be too slow for interactive use, while a slightly lower scorer in a tighter quantization can feel dramatically better day-to-day.

How WhichLLM makes its recommendation (and why it’s different)

WhichLLM’s recommendations come from two actively maintained sources, updated twice daily:

Chatbot Arena: human, blind-comparison votes (a proxy for user preference or “vibes”), updated daily.
ZeroEval: automated evaluation across reasoning, math, coding, knowledge, and multimodal.

Instead of trusting one leaderboard, WhichLLM aggregates and then normalizes results so very different tests can be compared and combined. Per benchmark, raw scores are min–max normalized onto a 0–100 scale:

> normalized = ((raw − min) / (max − min)) × 100

Then it produces per-use-case “unified scores” as weighted averages across benchmark groups. A “general” profile, for example, puts substantial weight on Arena plus key ZeroEval groups (reasoning/knowledge/math/multimodal). If a model lacks a benchmark group, that group is excluded and the weights are renormalized, so models aren’t unfairly punished for missing evaluations.

Two design choices are especially practical for local runners:

Benchmark-aware beats size-aware. WhichLLM explicitly positions itself as “ranked by real, recency-aware benchmarks, not parameter count.” Its own CLI demo even calls out the common trap: a larger model may fit your GPU, yet still rank below a smaller one because the smaller one performs better on the benchmarks that matter.
Recency-aware inputs. WhichLLM also excludes stale or unmaintained leaderboards, aiming to avoid systematically favoring older models simply because they were evaluated everywhere first.

If you’ve ever wondered why community consensus flips quickly as new quantized ports arrive, this is the reason: what’s “best” is a moving target, and tools that don’t account for that tend to freeze in time.

Step-by-step: using WhichLLM to pick a model you can actually run

WhichLLM is distributed as a CLI tool (MIT-licensed, open source) with a PyPI package and a GitHub repo, and it requires Python 3.11+.

A practical workflow looks like this:

Install and run the CLI.

WhichLLM is meant to be run on the machine where you’ll do inference, because it can auto-detect hardware resources.

Let it auto-detect (or simulate) hardware.

The tool can detect key constraints—especially VRAM—and then filter to models and quantized variants that fit. It can also simulate a target GPU if you’re deciding what hardware to buy.

Choose a use case filter.

Pick the profile that matches your workload: general, coding, math, or multimodal. Under the hood, WhichLLM applies different weights (for example, coding leans heavily on coding-centric evaluations such as SWE-Bench Verified and SciCode, as included in the extracted weighting examples).

Read the ranked output like a buyer, not a fan.

WhichLLM reports the essentials: model name, parameter count, quantization (e.g., Q4/Q5 variants), unified score, and throughput (tokens/sec). Use it to answer three questions:

Does it fit my VRAM in this quantization?
Is it top-ranked for my actual use case?
Is it fast enough to be usable?

Pick the best “top score that meets your latency needs.”

If you need responsiveness, you may prefer a smaller or more aggressively quantized model with a modest score drop. If quality is the priority, choose the highest unified scorer that still fits and runs at acceptable throughput.

The practical trade-offs WhichLLM surfaces

WhichLLM is built around the reality that local LLM performance is a three-way negotiation:

Quality vs. resources: A 32B model might “fit,” but still lose to a 27B or 13B on unified benchmarks for your use case.
Quantization vs. accuracy: Formats like Q4/Q5 reduce memory use (and can be the difference between “runs” and “doesn’t”), but can affect accuracy. WhichLLM doesn’t pretend quantization is free—it makes it visible.
Backend and format compatibility: Your stack matters (GGML-style formats, vLLM, transformers + bitsandbytes). WhichLLM’s filtering is focused on what’s runnable given constraints like VRAM and available formats, but you still need to confirm it aligns with the inference backend you plan to use.

The point isn’t that there’s one “best” model. It’s that there’s a best choice within your constraints, and benchmarks plus hardware awareness are the fastest way to find it.

Why It Matters Now

In 2026, the local model ecosystem is both larger and faster-moving: there are many quantized variants, frequent ports, and multiple major families in circulation (the brief cites examples like Llama 3.3, Mistral, and Qwen variants). That pace makes manual selection increasingly error-prone: you can waste hours downloading, converting, and testing models that either don’t run well on your hardware or aren’t actually strong on the tasks you care about.

WhichLLM is responding to that moment with a simple promise—“find the best local LLM for your hardware and use case”—and by grounding it in two complementary signals: human preference data (Arena) and multi-domain automated evaluation (ZeroEval), refreshed twice daily. For readers thinking about reliability and safety in real-world deployments, it’s the same lesson we’ve seen elsewhere: tools need to connect system constraints to outcomes (a theme that also shows up in autonomy failures like Why Waymo’s Robotaxis Drove Into Flooded Streets — and How to Stop It).

A quick checklist before you deploy

Confirm detected VRAM/CPU/RAM are correct, and note the recommended quant formats.
Ensure the use-case profile matches your workload (coding vs general vs multimodal).
Compare the top candidates on unified score + tokens/sec, not score alone.
Test the top 2–3 models on your real prompts/workflows before standardizing.

What to Watch

Leaderboard shifts: WhichLLM updates from Arena and ZeroEval twice daily, so rankings can move when new models or evaluations land.
Quantization and runtime improvements: Changes in quant formats and inference backends can reshuffle what’s optimal on the same hardware.
New model releases and ports: As major families and community ports arrive, recency-aware rankings can flip quickly—making periodic re-checks worth it. For daily engineering signals that track fast-moving tooling, see Claude Code, Token-Burn Risks, and UK Sovereign LLMs: Engineering Signals for Devs.

Sources: whichllm.app, github.com, geeksforgeeks.org, aitooldiscovery.com, letsdatascience.com, dasroot.net

About the Author

yrzhe

AI Product Thinker & Builder. Curating and analyzing tech news at TechScan AI. Follow @yrzhe_top on X for daily tech insights and commentary.

X/Twitter GitHub Blog