Benchmarks

Inference speed measurements for open-weight models across quantizations, backends, and hardware. Source YAMLs live in content/benchmarks/runs/.

Click a column header to sort. Hover the dotted-underlined labels for definitions. When no shape is selected, gen tok/s shows a range across all workload shapes at the chosen concurrency.

Gemma-4(8)released 2026-04

VariantQuantHardwareBackendConc.Gen tok/s
E2B-itQ4_K_MGeForce RTX 5070 · 11.94 GiBllama.cpp b9174 (cuda)1
206.5216.7
E4B-itQ4_K_MGeForce RTX 5070 · 11.94 GiBllama.cpp b9174 (cuda)1
120.5126.8

granite-4.1(4)released 2026-04

VariantQuantHardwareBackendConc.Gen tok/s
8bQ4_K_MGeForce RTX 5070 · 11.94 GiBllama.cpp b9174 (cuda)1
91.598.5

LFM2.5-350M(4)released 2025-11

VariantQuantHardwareBackendConc.Gen tok/s
350MQ4_K_MGeForce RTX 5070 · 11.94 GiBllama.cpp b9174 (cuda)1
761.9861.4

LFM2(16)released 2025-07

VariantQuantHardwareBackendConc.Gen tok/s
1.2BQ4_K_MGeForce RTX 5070 · 11.94 GiBllama.cpp b9174 (cuda)1
485.5529.6
1.2B-ToolQ4_K_MGeForce RTX 5070 · 11.94 GiBllama.cpp b9174 (cuda)1
475.8522.1
8B-A1BQ4_K_MGeForce RTX 5070 · 11.94 GiBllama.cpp b9174 (cuda)1
319.2364.9
2.6BQ4_K_MGeForce RTX 5070 · 11.94 GiBllama.cpp b9174 (cuda)1
249.9268.6

Gemma-3(4)released 2025-03

VariantQuantHardwareBackendConc.Gen tok/s
4b-itQ4_K_MGeForce RTX 5070 · 11.94 GiBllama.cpp b9174 (cuda)1
136.4157.9

Qwen2.5-Coder(4)released 2024-11

VariantQuantHardwareBackendConc.Gen tok/s
7B-InstructQ4_K_MGeForce RTX 5070 · 11.94 GiBllama.cpp b9174 (cuda)1
110.5119.4

Tok/s by workload (concurrency 1)

Same models, four different usage patterns. Prefill and decode are bound by different limits, so the same model produces noticeably different tok/s depending on prompt and answer length. KV-cache size, batch size, and any hidden reasoningtokens the model emits also move the number. Use the ranges in "Model speed range" above as the headline; this section explains why those ranges exist.

chat
LFM2.5-350MQ4_K_M · GeForce RTX 5070 · 11.94 GiB
761.9 tok/s
LFM2 1.2BQ4_K_M · GeForce RTX 5070 · 11.94 GiB
508.7 tok/s
LFM2 1.2B-ToolQ4_K_M · GeForce RTX 5070 · 11.94 GiB
499.9 tok/s
LFM2 8B-A1BQ4_K_M · GeForce RTX 5070 · 11.94 GiB
336.1 tok/s
LFM2 2.6BQ4_K_M · GeForce RTX 5070 · 11.94 GiB
259.5 tok/s
Gemma-4 E2B-itQ4_K_M · GeForce RTX 5070 · 11.94 GiB
211.9 tok/s
Gemma-3 4b-itQ4_K_M · GeForce RTX 5070 · 11.94 GiB
156.6 tok/s
Gemma-4 E4B-itQ4_K_M · GeForce RTX 5070 · 11.94 GiB
124.3 tok/s
Qwen2.5-Coder 7B-InstructQ4_K_M · GeForce RTX 5070 · 11.94 GiB
117.2 tok/s
granite-4.1 8bQ4_K_M · GeForce RTX 5070 · 11.94 GiB
97.6 tok/s

Short prompt, short answer. Generation-bound, so output tok/s is a clean reflection of the model's peak decode rate on this hardware.

rag
LFM2.5-350MQ4_K_M · GeForce RTX 5070 · 11.94 GiB
792.4 tok/s
LFM2 1.2BQ4_K_M · GeForce RTX 5070 · 11.94 GiB
485.5 tok/s
LFM2 1.2B-ToolQ4_K_M · GeForce RTX 5070 · 11.94 GiB
475.8 tok/s
LFM2 8B-A1BQ4_K_M · GeForce RTX 5070 · 11.94 GiB
319.2 tok/s
LFM2 2.6BQ4_K_M · GeForce RTX 5070 · 11.94 GiB
249.9 tok/s
Gemma-4 E2B-itQ4_K_M · GeForce RTX 5070 · 11.94 GiB
206.5 tok/s
Gemma-3 4b-itQ4_K_M · GeForce RTX 5070 · 11.94 GiB
136.4 tok/s
Gemma-4 E4B-itQ4_K_M · GeForce RTX 5070 · 11.94 GiB
120.5 tok/s
Qwen2.5-Coder 7B-InstructQ4_K_M · GeForce RTX 5070 · 11.94 GiB
110.5 tok/s
granite-4.1 8bQ4_K_M · GeForce RTX 5070 · 11.94 GiB
91.5 tok/s

Long stuffed-context prompt, short answer. Prefill dominates time-to-first-token; gen tok/s usually dips slightly because the KV cache is hot but bigger.

codegen
LFM2.5-350MQ4_K_M · GeForce RTX 5070 · 11.94 GiB
861.4 tok/s
LFM2 1.2BQ4_K_M · GeForce RTX 5070 · 11.94 GiB
529.6 tok/s
LFM2 1.2B-ToolQ4_K_M · GeForce RTX 5070 · 11.94 GiB
522.1 tok/s
LFM2 8B-A1BQ4_K_M · GeForce RTX 5070 · 11.94 GiB
364.9 tok/s
LFM2 2.6BQ4_K_M · GeForce RTX 5070 · 11.94 GiB
268.6 tok/s
Gemma-4 E2B-itQ4_K_M · GeForce RTX 5070 · 11.94 GiB
216.7 tok/s
Gemma-3 4b-itQ4_K_M · GeForce RTX 5070 · 11.94 GiB
157.9 tok/s
Gemma-4 E4B-itQ4_K_M · GeForce RTX 5070 · 11.94 GiB
126.8 tok/s
Qwen2.5-Coder 7B-InstructQ4_K_M · GeForce RTX 5070 · 11.94 GiB
119.4 tok/s
granite-4.1 8bQ4_K_M · GeForce RTX 5070 · 11.94 GiB
98.5 tok/s

Short prompt, long answer (~1k tokens). Pure decode loop. Numbers here tend to be the closest to the model's sustained ceiling.

agent
LFM2.5-350MQ4_K_M · GeForce RTX 5070 · 11.94 GiB
824.4 tok/s
LFM2 1.2BQ4_K_M · GeForce RTX 5070 · 11.94 GiB
513.2 tok/s
LFM2 1.2B-ToolQ4_K_M · GeForce RTX 5070 · 11.94 GiB
504.5 tok/s
LFM2 8B-A1BQ4_K_M · GeForce RTX 5070 · 11.94 GiB
355.0 tok/s
LFM2 2.6BQ4_K_M · GeForce RTX 5070 · 11.94 GiB
262.7 tok/s
Gemma-4 E2B-itQ4_K_M · GeForce RTX 5070 · 11.94 GiB
209.5 tok/s
Gemma-3 4b-itQ4_K_M · GeForce RTX 5070 · 11.94 GiB
151.3 tok/s
Gemma-4 E4B-itQ4_K_M · GeForce RTX 5070 · 11.94 GiB
123.2 tok/s
Qwen2.5-Coder 7B-InstructQ4_K_M · GeForce RTX 5070 · 11.94 GiB
114.5 tok/s
granite-4.1 8bQ4_K_M · GeForce RTX 5070 · 11.94 GiB
94.6 tok/s

Mid-length prompt with tool-call shape, mid-length answer. Realistic for agentic loops. The big drop you'll see at concurrency 4 (in the per-model detail page) is the more useful agent number.

Caveat: reasoning models

Models that stream a hidden reasoning_content channel before the user-visible answer (Qwen3.5/3.6, DeepSeek-R1, GPT-OSS reasoning variants) currently count those tokens in output_tok_per_s. The decode rate is honest, but the rate of useful answer text is lower because some of every token budget is spent on the hidden chain-of-thought. The schema flag model.reasoningis not yet reliable across providers, so the per-model detail pages don't mark them explicitly.

Next: separate reasoning_tokens_median from content_tokens_median in the harness, and add a reasoning-disabled run mode (per-model: Qwen enable_thinking: false, DeepSeek /no_think, etc.).

Hardware tested

The rigs producing the numbers above. Use the hardware filter at the top of the page to scope results to a specific machine.

cpuAMD Ryzen 9 7900 12-Core Processor
gpuNVIDIA GeForce RTX 5070
archNVIDIA
vram11.94 GiB (system 30.4 GiB)
power250 W / 300 W max(83% cap)
osCachyOS
kernel7.0.0-1-cachyos
driver595.58.03
backendsllama.cpp b9174 (vulkan)

Coming soon

Queued for the next benchmark pass. Tracking notes live in docs/benchmark-campaign.md in the repo.

  • Strix vLLM FP8 + MTP-1 + draft-spec on Qwen3.6-27B. Blocked on lemonade's hardcoded backend-readiness timeout cutting off the first-load FP8 kernel autotune. Bypass via the bundled vLLM binary to warm the cache, then hand back to lemonade.
  • Strix quant sweep mirroring the 3090 Q2_K..Q6_K data on ROCm. Lemonade's async pull semantics broke the first attempt; switching to a raw llama-server invocation against the bundled ROCm binary.
  • Strix quant-creator comparison for one model from unsloth, bartowski, ggml-org where they all ship the same nominal quant.
  • Strix-only heavyweights: Mistral-Medium-3.5 128B, Mistral-Small-4 119B, Qwen3-Coder-Next 80B. Each is a 50-75 GB Q4_K_M download.
  • RTX 5070 CUDA pass. Currently Vulkan-only because CUDA toolkit install hit a packaging blocker on CachyOS; that's resolved now, so a llama.cpp CUDA build can land for a CUDA-vs-Vulkan comparison on the same NVIDIA silicon.
  • Driver and power-cap sweeps on the RTX 3090 once the rest of the matrix settles.