Benchmarks

Inference speed measurements for open-weight models across quantizations, backends, and hardware. Source YAMLs live in content/benchmarks/runs/.

Hardware tested(1 rig · click for power caps, drivers, clocks, PCIe)
Gaming desktop· Custom build
cpuAMD Ryzen 9 7900 12-Core Processor
gpuNVIDIA GeForce RTX 5070
archNVIDIA
vram11.94 GiB (system 30.5 GiB)
power250 W / 300 W max(83% cap)
osCachyOS
kernel7.0.8-1-cachyos
driver595.71.05
backendsllama.cpp cuda-1a68ec9 (cuda), llama.cpp vulkan-1a68ec9 (vulkan), llama.cpp b9174 (vulkan)

A daily-driver gaming PC on CachyOS with an RTX 5070, pressed into service as a benchmark host between gaming sessions. The card sits at a 250 W of 300 W stock power cap (83%) by default on this rig; that limit is captured in the YAML and surfaced on each run.

Inference uses the prebuilt llama.cpp Vulkan binary (no CUDA toolkit or sudo on this host), so all RTX 5070 numbers here are Vulkan-backed rather than CUDA. That makes them directly comparable to the Strix Halo Vulkan numbers (same backend, different silicon) but understates what the card can do with CUDA. A CUDA pass will land later.

  • GPU: NVIDIA GeForce RTX 5070, 12 GiB GDDR7, 250 W cap (300 W max)
  • CPU: AMD Ryzen 9 7900 (12-core); the integrated Radeon iGPU is also visible to Vulkan as a second device but explicitly excluded from every bench via --device Vulkan0 --split-mode none --main-gpu 0
  • Driver: 595.58.03
  • OS: CachyOS rolling, Linux 7.0
  • VRAM-fit verification: every run snapshots GPU memory before and after the server starts and aborts if the delta is smaller than the model file size — guards against silent CPU spill

Click a column header to sort. Hover the dotted-underlined labels for definitions. When no shape is selected, gen tok/s shows a range across all workload shapes at the chosen concurrency.

Gemma-4(8)released 2026-04

VariantQuantHardwareBackendModeConc.Gen tok/s
E2B-itQ4_K_M
GeForce RTX 5070 · 12 GiB250 Wdrv 595
llama.cpp b9174 (vulkan)baseline1
206.5216.7
E4B-itQ4_K_M
GeForce RTX 5070 · 12 GiB250 Wdrv 595
llama.cpp b9174 (vulkan)baseline1
120.5126.8

granite-4.1(4)released 2026-04

VariantQuantHardwareBackendModeConc.Gen tok/s
8bQ4_K_M
GeForce RTX 5070 · 12 GiB250 Wdrv 595
llama.cpp b9174 (vulkan)baseline1
91.598.5

LFM2.5-350M(4)released 2025-11

VariantQuantHardwareBackendModeConc.Gen tok/s
350MQ4_K_M
GeForce RTX 5070 · 12 GiB250 Wdrv 595
llama.cpp b9174 (vulkan)baseline1
761.9861.4

LFM2(16)released 2025-07

VariantQuantHardwareBackendModeConc.Gen tok/s
1.2BQ4_K_M
GeForce RTX 5070 · 12 GiB250 Wdrv 595
llama.cpp b9174 (vulkan)baseline1
485.5529.6
1.2B-ToolQ4_K_M
GeForce RTX 5070 · 12 GiB250 Wdrv 595
llama.cpp b9174 (vulkan)baseline1
475.8522.1
8B-A1BQ4_K_M
GeForce RTX 5070 · 12 GiB250 Wdrv 595
llama.cpp b9174 (vulkan)baseline1
319.2364.9
2.6BQ4_K_M
GeForce RTX 5070 · 12 GiB250 Wdrv 595
llama.cpp b9174 (vulkan)baseline1
249.9268.6

Gemma-3(12)released 2025-03

VariantQuantHardwareBackendModeConc.Gen tok/s
4b-itQ4_K_M
GeForce RTX 5070 · 12 GiB250 Wdrv 595
llama.cpp cuda-1a68ec9 (cuda)baseline1
143.0168.7

Qwen2.5-Coder(4)released 2024-11

VariantQuantHardwareBackendModeConc.Gen tok/s
7B-InstructQ4_K_M
GeForce RTX 5070 · 12 GiB250 Wdrv 595
llama.cpp b9174 (vulkan)baseline1
110.5119.4

Tok/s by workload (concurrency 1)

Same models, four different usage patterns. Prefill and decode are bound by different limits, so the same model produces noticeably different tok/s depending on prompt and answer length. KV-cache size, batch size, and any hidden reasoningtokens the model emits also move the number. Use the ranges in "Model speed range" above as the headline; this section explains why those ranges exist.

chat
LFM2.5-350MQ4_K_M · GeForce RTX 5070 · 12 GiB
761.9 tok/s
LFM2 1.2BQ4_K_M · GeForce RTX 5070 · 12 GiB
508.7 tok/s
LFM2 1.2B-ToolQ4_K_M · GeForce RTX 5070 · 12 GiB
499.9 tok/s
LFM2 8B-A1BQ4_K_M · GeForce RTX 5070 · 12 GiB
336.1 tok/s
LFM2 2.6BQ4_K_M · GeForce RTX 5070 · 12 GiB
259.5 tok/s
Gemma-4 E2B-itQ4_K_M · GeForce RTX 5070 · 12 GiB
211.9 tok/s
Gemma-3 4b-itQ4_K_M · GeForce RTX 5070 · 12 GiB
166.9 tok/s
Gemma-4 E4B-itQ4_K_M · GeForce RTX 5070 · 12 GiB
124.3 tok/s
Qwen2.5-Coder 7B-InstructQ4_K_M · GeForce RTX 5070 · 12 GiB
117.2 tok/s
granite-4.1 8bQ4_K_M · GeForce RTX 5070 · 12 GiB
97.6 tok/s

Short prompt, short answer. Generation-bound, so output tok/s is a clean reflection of the model's peak decode rate on this hardware.

rag
LFM2.5-350MQ4_K_M · GeForce RTX 5070 · 12 GiB
792.4 tok/s
LFM2 1.2BQ4_K_M · GeForce RTX 5070 · 12 GiB
485.5 tok/s
LFM2 1.2B-ToolQ4_K_M · GeForce RTX 5070 · 12 GiB
475.8 tok/s
LFM2 8B-A1BQ4_K_M · GeForce RTX 5070 · 12 GiB
319.2 tok/s
LFM2 2.6BQ4_K_M · GeForce RTX 5070 · 12 GiB
249.9 tok/s
Gemma-4 E2B-itQ4_K_M · GeForce RTX 5070 · 12 GiB
206.5 tok/s
Gemma-3 4b-itQ4_K_M · GeForce RTX 5070 · 12 GiB
143.0 tok/s
Gemma-4 E4B-itQ4_K_M · GeForce RTX 5070 · 12 GiB
120.5 tok/s
Qwen2.5-Coder 7B-InstructQ4_K_M · GeForce RTX 5070 · 12 GiB
110.5 tok/s
granite-4.1 8bQ4_K_M · GeForce RTX 5070 · 12 GiB
91.5 tok/s

Long stuffed-context prompt, short answer. Prefill dominates time-to-first-token; gen tok/s usually dips slightly because the KV cache is hot but bigger.

codegen
LFM2.5-350MQ4_K_M · GeForce RTX 5070 · 12 GiB
861.4 tok/s
LFM2 1.2BQ4_K_M · GeForce RTX 5070 · 12 GiB
529.6 tok/s
LFM2 1.2B-ToolQ4_K_M · GeForce RTX 5070 · 12 GiB
522.1 tok/s
LFM2 8B-A1BQ4_K_M · GeForce RTX 5070 · 12 GiB
364.9 tok/s
LFM2 2.6BQ4_K_M · GeForce RTX 5070 · 12 GiB
268.6 tok/s
Gemma-4 E2B-itQ4_K_M · GeForce RTX 5070 · 12 GiB
216.7 tok/s
Gemma-3 4b-itQ4_K_M · GeForce RTX 5070 · 12 GiB
168.7 tok/s
Gemma-4 E4B-itQ4_K_M · GeForce RTX 5070 · 12 GiB
126.8 tok/s
Qwen2.5-Coder 7B-InstructQ4_K_M · GeForce RTX 5070 · 12 GiB
119.4 tok/s
granite-4.1 8bQ4_K_M · GeForce RTX 5070 · 12 GiB
98.5 tok/s

Short prompt, long answer (~1k tokens). Pure decode loop. Numbers here tend to be the closest to the model's sustained ceiling.

agent
LFM2.5-350MQ4_K_M · GeForce RTX 5070 · 12 GiB
824.4 tok/s
LFM2 1.2BQ4_K_M · GeForce RTX 5070 · 12 GiB
513.2 tok/s
LFM2 1.2B-ToolQ4_K_M · GeForce RTX 5070 · 12 GiB
504.5 tok/s
LFM2 8B-A1BQ4_K_M · GeForce RTX 5070 · 12 GiB
355.0 tok/s
LFM2 2.6BQ4_K_M · GeForce RTX 5070 · 12 GiB
262.7 tok/s
Gemma-4 E2B-itQ4_K_M · GeForce RTX 5070 · 12 GiB
209.5 tok/s
Gemma-3 4b-itQ4_K_M · GeForce RTX 5070 · 12 GiB
163.7 tok/s
Gemma-4 E4B-itQ4_K_M · GeForce RTX 5070 · 12 GiB
123.2 tok/s
Qwen2.5-Coder 7B-InstructQ4_K_M · GeForce RTX 5070 · 12 GiB
114.5 tok/s
granite-4.1 8bQ4_K_M · GeForce RTX 5070 · 12 GiB
94.6 tok/s

Mid-length prompt with tool-call shape, mid-length answer. Realistic for agentic loops. The big drop you'll see at concurrency 4 (in the per-model detail page) is the more useful agent number.

Caveat: reasoning models

Models that stream a hidden reasoning_content channel before the user-visible answer (Qwen3.5/3.6, DeepSeek-R1, GPT-OSS reasoning variants) currently count those tokens in output_tok_per_s. The decode rate is honest, but the rate of useful answer text is lower because some of every token budget is spent on the hidden chain-of-thought. The schema flag model.reasoningis not yet reliable across providers, so the per-model detail pages don't mark them explicitly.

Next: separate reasoning_tokens_median from content_tokens_median in the harness, and add a reasoning-disabled run mode (per-model: Qwen enable_thinking: false, DeepSeek /no_think, etc.).

Coming soon

Queued for the next benchmark pass. Tracking notes live in docs/benchmark-campaign.md in the repo.

  • Strix quant sweep mirroring the 3090 Q2_K..Q6_K data on ROCm. Lemonade's async pull semantics broke the first attempt; switching to a raw llama-server invocation against the bundled ROCm binary.
  • Strix quant-creator comparison for one model from unsloth, bartowski, ggml-org where they all ship the same nominal quant.
  • Strix-only heavyweights: Mistral-Medium-3.5 128B, Mistral-Small-4 119B, Qwen3-Coder-Next 80B. Each is a 50-75 GB Q4_K_M download.
  • RTX 5070 CUDA pass. Currently Vulkan-only because CUDA toolkit install hit a packaging blocker on CachyOS; that's resolved now, so a llama.cpp CUDA build can land for a CUDA-vs-Vulkan comparison on the same NVIDIA silicon.
  • Driver and power-cap sweeps on the RTX 3090 once the rest of the matrix settles.