Benchmarks

Inference speed measurements for open-weight models across quantizations, backends, and hardware. Source YAMLs live in content/benchmarks/runs/.

Click a column header to sort. Hover the dotted-underlined labels for definitions. When no shape is selected, gen tok/s shows a range across all workload shapes at the chosen concurrency.

▸ Gemma-4(8)released 2026-04

Variant	Quant	Hardware	Backend	Conc.	Gen tok/s ↓
E2B-it	Q4_K_M	GeForce RTX 5070 · 11.94 GiB	llama.cpp b9174 (cuda)	1	206.5–216.7
E4B-it	Q4_K_M	GeForce RTX 5070 · 11.94 GiB	llama.cpp b9174 (cuda)	1	120.5–126.8

▸ granite-4.1(4)released 2026-04

Variant	Quant	Hardware	Backend	Conc.	Gen tok/s ↓
8b	Q4_K_M	GeForce RTX 5070 · 11.94 GiB	llama.cpp b9174 (cuda)	1	91.5–98.5

▸ LFM2.5-350M(4)released 2025-11

Variant	Quant	Hardware	Backend	Conc.	Gen tok/s ↓
350M	Q4_K_M	GeForce RTX 5070 · 11.94 GiB	llama.cpp b9174 (cuda)	1	761.9–861.4

▸ LFM2(16)released 2025-07

Variant	Quant	Hardware	Backend	Conc.	Gen tok/s ↓
1.2B	Q4_K_M	GeForce RTX 5070 · 11.94 GiB	llama.cpp b9174 (cuda)	1	485.5–529.6
1.2B-Tool	Q4_K_M	GeForce RTX 5070 · 11.94 GiB	llama.cpp b9174 (cuda)	1	475.8–522.1
8B-A1B	Q4_K_M	GeForce RTX 5070 · 11.94 GiB	llama.cpp b9174 (cuda)	1	319.2–364.9
2.6B	Q4_K_M	GeForce RTX 5070 · 11.94 GiB	llama.cpp b9174 (cuda)	1	249.9–268.6

▸ Gemma-3(4)released 2025-03

Variant	Quant	Hardware	Backend	Conc.	Gen tok/s ↓
4b-it	Q4_K_M	GeForce RTX 5070 · 11.94 GiB	llama.cpp b9174 (cuda)	1	136.4–157.9

▸ Qwen2.5-Coder(4)released 2024-11

Variant	Quant	Hardware	Backend	Conc.	Gen tok/s ↓
7B-Instruct	Q4_K_M	GeForce RTX 5070 · 11.94 GiB	llama.cpp b9174 (cuda)	1	110.5–119.4

Tok/s by workload (concurrency 1)

Same models, four different usage patterns. Prefill and decode are bound by different limits, so the same model produces noticeably different tok/s depending on prompt and answer length. KV-cache size, batch size, and any hidden reasoningtokens the model emits also move the number. Use the ranges in "Model speed range" above as the headline; this section explains why those ranges exist.

chat

LFM2.5-350MQ4_K_M · GeForce RTX 5070 · 11.94 GiB

761.9 tok/s

LFM2 1.2BQ4_K_M · GeForce RTX 5070 · 11.94 GiB

508.7 tok/s

LFM2 1.2B-ToolQ4_K_M · GeForce RTX 5070 · 11.94 GiB

499.9 tok/s

LFM2 8B-A1BQ4_K_M · GeForce RTX 5070 · 11.94 GiB

336.1 tok/s

LFM2 2.6BQ4_K_M · GeForce RTX 5070 · 11.94 GiB

259.5 tok/s

Gemma-4 E2B-itQ4_K_M · GeForce RTX 5070 · 11.94 GiB

211.9 tok/s

Gemma-3 4b-itQ4_K_M · GeForce RTX 5070 · 11.94 GiB

156.6 tok/s

Gemma-4 E4B-itQ4_K_M · GeForce RTX 5070 · 11.94 GiB

124.3 tok/s

Qwen2.5-Coder 7B-InstructQ4_K_M · GeForce RTX 5070 · 11.94 GiB

117.2 tok/s

granite-4.1 8bQ4_K_M · GeForce RTX 5070 · 11.94 GiB

97.6 tok/s

Short prompt, short answer. Generation-bound, so output tok/s is a clean reflection of the model's peak decode rate on this hardware.

rag

LFM2.5-350MQ4_K_M · GeForce RTX 5070 · 11.94 GiB

792.4 tok/s

LFM2 1.2BQ4_K_M · GeForce RTX 5070 · 11.94 GiB

485.5 tok/s

LFM2 1.2B-ToolQ4_K_M · GeForce RTX 5070 · 11.94 GiB

475.8 tok/s

LFM2 8B-A1BQ4_K_M · GeForce RTX 5070 · 11.94 GiB

319.2 tok/s

LFM2 2.6BQ4_K_M · GeForce RTX 5070 · 11.94 GiB

249.9 tok/s

Gemma-4 E2B-itQ4_K_M · GeForce RTX 5070 · 11.94 GiB

206.5 tok/s

Gemma-3 4b-itQ4_K_M · GeForce RTX 5070 · 11.94 GiB

136.4 tok/s

Gemma-4 E4B-itQ4_K_M · GeForce RTX 5070 · 11.94 GiB

120.5 tok/s

Qwen2.5-Coder 7B-InstructQ4_K_M · GeForce RTX 5070 · 11.94 GiB

110.5 tok/s

granite-4.1 8bQ4_K_M · GeForce RTX 5070 · 11.94 GiB

91.5 tok/s

Long stuffed-context prompt, short answer. Prefill dominates time-to-first-token; gen tok/s usually dips slightly because the KV cache is hot but bigger.

codegen

LFM2.5-350MQ4_K_M · GeForce RTX 5070 · 11.94 GiB

861.4 tok/s

LFM2 1.2BQ4_K_M · GeForce RTX 5070 · 11.94 GiB

529.6 tok/s

LFM2 1.2B-ToolQ4_K_M · GeForce RTX 5070 · 11.94 GiB

522.1 tok/s

LFM2 8B-A1BQ4_K_M · GeForce RTX 5070 · 11.94 GiB

364.9 tok/s

LFM2 2.6BQ4_K_M · GeForce RTX 5070 · 11.94 GiB

268.6 tok/s

Gemma-4 E2B-itQ4_K_M · GeForce RTX 5070 · 11.94 GiB

216.7 tok/s

Gemma-3 4b-itQ4_K_M · GeForce RTX 5070 · 11.94 GiB

157.9 tok/s

Gemma-4 E4B-itQ4_K_M · GeForce RTX 5070 · 11.94 GiB

126.8 tok/s

Qwen2.5-Coder 7B-InstructQ4_K_M · GeForce RTX 5070 · 11.94 GiB

119.4 tok/s

granite-4.1 8bQ4_K_M · GeForce RTX 5070 · 11.94 GiB

98.5 tok/s

Short prompt, long answer (~1k tokens). Pure decode loop. Numbers here tend to be the closest to the model's sustained ceiling.

agent

LFM2.5-350MQ4_K_M · GeForce RTX 5070 · 11.94 GiB

824.4 tok/s

LFM2 1.2BQ4_K_M · GeForce RTX 5070 · 11.94 GiB

513.2 tok/s

LFM2 1.2B-ToolQ4_K_M · GeForce RTX 5070 · 11.94 GiB

504.5 tok/s

LFM2 8B-A1BQ4_K_M · GeForce RTX 5070 · 11.94 GiB

355.0 tok/s

LFM2 2.6BQ4_K_M · GeForce RTX 5070 · 11.94 GiB

262.7 tok/s

Gemma-4 E2B-itQ4_K_M · GeForce RTX 5070 · 11.94 GiB

209.5 tok/s

Gemma-3 4b-itQ4_K_M · GeForce RTX 5070 · 11.94 GiB

151.3 tok/s

Gemma-4 E4B-itQ4_K_M · GeForce RTX 5070 · 11.94 GiB

123.2 tok/s

Qwen2.5-Coder 7B-InstructQ4_K_M · GeForce RTX 5070 · 11.94 GiB

114.5 tok/s

granite-4.1 8bQ4_K_M · GeForce RTX 5070 · 11.94 GiB

94.6 tok/s

Mid-length prompt with tool-call shape, mid-length answer. Realistic for agentic loops. The big drop you'll see at concurrency 4 (in the per-model detail page) is the more useful agent number.

Caveat: reasoning models

Models that stream a hidden reasoning_content channel before the user-visible answer (Qwen3.5/3.6, DeepSeek-R1, GPT-OSS reasoning variants) currently count those tokens in output_tok_per_s. The decode rate is honest, but the rate of useful answer text is lower because some of every token budget is spent on the hidden chain-of-thought. The schema flag model.reasoningis not yet reliable across providers, so the per-model detail pages don't mark them explicitly.

Next: separate reasoning_tokens_median from content_tokens_median in the harness, and add a reasoning-disabled run mode (per-model: Qwen enable_thinking: false, DeepSeek /no_think, etc.).

Hardware tested

The rigs producing the numbers above. Use the hardware filter at the top of the page to scope results to a specific machine.

GeForce RTX 5070 · 11.94 GiB50 runs

cpuAMD Ryzen 9 7900 12-Core Processor

gpuNVIDIA GeForce RTX 5070

archNVIDIA

vram11.94 GiB (system 30.4 GiB)

power250 W / 300 W max(83% cap)

osCachyOS

kernel7.0.0-1-cachyos

driver595.58.03

backendsllama.cpp b9174 (vulkan)

Coming soon

Queued for the next benchmark pass. Tracking notes live in docs/benchmark-campaign.md in the repo.

Strix vLLM FP8 + MTP-1 + draft-spec on Qwen3.6-27B. Blocked on lemonade's hardcoded backend-readiness timeout cutting off the first-load FP8 kernel autotune. Bypass via the bundled vLLM binary to warm the cache, then hand back to lemonade.
Strix quant sweep mirroring the 3090 Q2_K..Q6_K data on ROCm. Lemonade's async pull semantics broke the first attempt; switching to a raw llama-server invocation against the bundled ROCm binary.
Strix quant-creator comparison for one model from unsloth, bartowski, ggml-org where they all ship the same nominal quant.
Strix-only heavyweights: Mistral-Medium-3.5 128B, Mistral-Small-4 119B, Qwen3-Coder-Next 80B. Each is a 50-75 GB Q4_K_M download.
RTX 5070 CUDA pass. Currently Vulkan-only because CUDA toolkit install hit a packaging blocker on CachyOS; that's resolved now, so a llama.cpp CUDA build can land for a CUDA-vs-Vulkan comparison on the same NVIDIA silicon.
Driver and power-cap sweeps on the RTX 3090 once the rest of the matrix settles.