Benchmarks

Inference speed measurements for open-weight models across quantizations, backends, and hardware. Source YAMLs live in content/benchmarks/runs/.

Hardware tested(1 rig · click for power caps, drivers, clocks, PCIe)

GeForce RTX 5070 · 12 GiB60 runs

Gaming desktop· Custom build

cpuAMD Ryzen 9 7900 12-Core Processor

gpuNVIDIA GeForce RTX 5070

archNVIDIA

vram11.94 GiB (system 30.5 GiB)

power250 W / 300 W max(83% cap)

osCachyOS

kernel7.0.8-1-cachyos

driver595.71.05

backendsllama.cpp cuda-1a68ec9 (cuda), llama.cpp vulkan-1a68ec9 (vulkan), llama.cpp b9174 (vulkan)

A daily-driver gaming PC on CachyOS with an RTX 5070, pressed into service as a benchmark host between gaming sessions. The card sits at a 250 W of 300 W stock power cap (83%) by default on this rig; that limit is captured in the YAML and surfaced on each run.

Inference uses the prebuilt llama.cpp Vulkan binary (no CUDA toolkit or sudo on this host), so all RTX 5070 numbers here are Vulkan-backed rather than CUDA. That makes them directly comparable to the Strix Halo Vulkan numbers (same backend, different silicon) but understates what the card can do with CUDA. A CUDA pass will land later.

GPU: NVIDIA GeForce RTX 5070, 12 GiB GDDR7, 250 W cap (300 W max)
CPU: AMD Ryzen 9 7900 (12-core); the integrated Radeon iGPU is also visible to Vulkan as a second device but explicitly excluded from every bench via --device Vulkan0 --split-mode none --main-gpu 0
Driver: 595.58.03
OS: CachyOS rolling, Linux 7.0
VRAM-fit verification: every run snapshots GPU memory before and after the server starts and aborts if the delta is smaller than the model file size — guards against silent CPU spill

Click a column header to sort. Hover the dotted-underlined labels for definitions. When no shape is selected, gen tok/s shows a range across all workload shapes at the chosen concurrency.

▸ Gemma-4(8)released 2026-04

Variant	Quant	Hardware	Backend	Mode	Conc.	Gen tok/s ↓
E2B-it	Q4_K_M	GeForce RTX 5070 · 12 GiB250 Wdrv 595	llama.cpp b9174 (vulkan)	baseline	1	206.5–216.7
E4B-it	Q4_K_M	GeForce RTX 5070 · 12 GiB250 Wdrv 595	llama.cpp b9174 (vulkan)	baseline	1	120.5–126.8

▸ granite-4.1(4)released 2026-04

Variant	Quant	Hardware	Backend	Mode	Conc.	Gen tok/s ↓
8b	Q4_K_M	GeForce RTX 5070 · 12 GiB250 Wdrv 595	llama.cpp b9174 (vulkan)	baseline	1	91.5–98.5

▸ LFM2.5-350M(4)released 2025-11

Variant	Quant	Hardware	Backend	Mode	Conc.	Gen tok/s ↓
350M	Q4_K_M	GeForce RTX 5070 · 12 GiB250 Wdrv 595	llama.cpp b9174 (vulkan)	baseline	1	761.9–861.4

▸ LFM2(16)released 2025-07

Variant	Quant	Hardware	Backend	Mode	Conc.	Gen tok/s ↓
1.2B	Q4_K_M	GeForce RTX 5070 · 12 GiB250 Wdrv 595	llama.cpp b9174 (vulkan)	baseline	1	485.5–529.6
1.2B-Tool	Q4_K_M	GeForce RTX 5070 · 12 GiB250 Wdrv 595	llama.cpp b9174 (vulkan)	baseline	1	475.8–522.1
8B-A1B	Q4_K_M	GeForce RTX 5070 · 12 GiB250 Wdrv 595	llama.cpp b9174 (vulkan)	baseline	1	319.2–364.9
2.6B	Q4_K_M	GeForce RTX 5070 · 12 GiB250 Wdrv 595	llama.cpp b9174 (vulkan)	baseline	1	249.9–268.6

▸ Gemma-3(12)released 2025-03

Variant	Quant	Hardware	Backend	Mode	Conc.	Gen tok/s ↓
4b-it	Q4_K_M	GeForce RTX 5070 · 12 GiB250 Wdrv 595	llama.cpp cuda-1a68ec9 (cuda)	baseline	1	143.0–168.7

▸ Qwen2.5-Coder(4)released 2024-11

Variant	Quant	Hardware	Backend	Mode	Conc.	Gen tok/s ↓
7B-Instruct	Q4_K_M	GeForce RTX 5070 · 12 GiB250 Wdrv 595	llama.cpp b9174 (vulkan)	baseline	1	110.5–119.4

Tok/s by workload (concurrency 1)

Same models, four different usage patterns. Prefill and decode are bound by different limits, so the same model produces noticeably different tok/s depending on prompt and answer length. KV-cache size, batch size, and any hidden reasoningtokens the model emits also move the number. Use the ranges in "Model speed range" above as the headline; this section explains why those ranges exist.

chat

LFM2.5-350MQ4_K_M · GeForce RTX 5070 · 12 GiB

761.9 tok/s

LFM2 1.2BQ4_K_M · GeForce RTX 5070 · 12 GiB

508.7 tok/s

LFM2 1.2B-ToolQ4_K_M · GeForce RTX 5070 · 12 GiB

499.9 tok/s

LFM2 8B-A1BQ4_K_M · GeForce RTX 5070 · 12 GiB

336.1 tok/s

LFM2 2.6BQ4_K_M · GeForce RTX 5070 · 12 GiB

259.5 tok/s

Gemma-4 E2B-itQ4_K_M · GeForce RTX 5070 · 12 GiB

211.9 tok/s

Gemma-3 4b-itQ4_K_M · GeForce RTX 5070 · 12 GiB

166.9 tok/s

Gemma-4 E4B-itQ4_K_M · GeForce RTX 5070 · 12 GiB

124.3 tok/s

Qwen2.5-Coder 7B-InstructQ4_K_M · GeForce RTX 5070 · 12 GiB

117.2 tok/s

granite-4.1 8bQ4_K_M · GeForce RTX 5070 · 12 GiB

97.6 tok/s

Short prompt, short answer. Generation-bound, so output tok/s is a clean reflection of the model's peak decode rate on this hardware.

rag

LFM2.5-350MQ4_K_M · GeForce RTX 5070 · 12 GiB

792.4 tok/s

LFM2 1.2BQ4_K_M · GeForce RTX 5070 · 12 GiB

485.5 tok/s

LFM2 1.2B-ToolQ4_K_M · GeForce RTX 5070 · 12 GiB

475.8 tok/s

LFM2 8B-A1BQ4_K_M · GeForce RTX 5070 · 12 GiB

319.2 tok/s

LFM2 2.6BQ4_K_M · GeForce RTX 5070 · 12 GiB

249.9 tok/s

Gemma-4 E2B-itQ4_K_M · GeForce RTX 5070 · 12 GiB

206.5 tok/s

Gemma-3 4b-itQ4_K_M · GeForce RTX 5070 · 12 GiB

143.0 tok/s

Gemma-4 E4B-itQ4_K_M · GeForce RTX 5070 · 12 GiB

120.5 tok/s

Qwen2.5-Coder 7B-InstructQ4_K_M · GeForce RTX 5070 · 12 GiB

110.5 tok/s

granite-4.1 8bQ4_K_M · GeForce RTX 5070 · 12 GiB

91.5 tok/s

Long stuffed-context prompt, short answer. Prefill dominates time-to-first-token; gen tok/s usually dips slightly because the KV cache is hot but bigger.

codegen

LFM2.5-350MQ4_K_M · GeForce RTX 5070 · 12 GiB

861.4 tok/s

LFM2 1.2BQ4_K_M · GeForce RTX 5070 · 12 GiB

529.6 tok/s

LFM2 1.2B-ToolQ4_K_M · GeForce RTX 5070 · 12 GiB

522.1 tok/s

LFM2 8B-A1BQ4_K_M · GeForce RTX 5070 · 12 GiB

364.9 tok/s

LFM2 2.6BQ4_K_M · GeForce RTX 5070 · 12 GiB

268.6 tok/s

Gemma-4 E2B-itQ4_K_M · GeForce RTX 5070 · 12 GiB

216.7 tok/s

Gemma-3 4b-itQ4_K_M · GeForce RTX 5070 · 12 GiB

168.7 tok/s

Gemma-4 E4B-itQ4_K_M · GeForce RTX 5070 · 12 GiB

126.8 tok/s

Qwen2.5-Coder 7B-InstructQ4_K_M · GeForce RTX 5070 · 12 GiB

119.4 tok/s

granite-4.1 8bQ4_K_M · GeForce RTX 5070 · 12 GiB

98.5 tok/s

Short prompt, long answer (~1k tokens). Pure decode loop. Numbers here tend to be the closest to the model's sustained ceiling.

agent

LFM2.5-350MQ4_K_M · GeForce RTX 5070 · 12 GiB

824.4 tok/s

LFM2 1.2BQ4_K_M · GeForce RTX 5070 · 12 GiB

513.2 tok/s

LFM2 1.2B-ToolQ4_K_M · GeForce RTX 5070 · 12 GiB

504.5 tok/s

LFM2 8B-A1BQ4_K_M · GeForce RTX 5070 · 12 GiB

355.0 tok/s

LFM2 2.6BQ4_K_M · GeForce RTX 5070 · 12 GiB

262.7 tok/s

Gemma-4 E2B-itQ4_K_M · GeForce RTX 5070 · 12 GiB

209.5 tok/s

Gemma-3 4b-itQ4_K_M · GeForce RTX 5070 · 12 GiB

163.7 tok/s

Gemma-4 E4B-itQ4_K_M · GeForce RTX 5070 · 12 GiB

123.2 tok/s

Qwen2.5-Coder 7B-InstructQ4_K_M · GeForce RTX 5070 · 12 GiB

114.5 tok/s

granite-4.1 8bQ4_K_M · GeForce RTX 5070 · 12 GiB

94.6 tok/s

Mid-length prompt with tool-call shape, mid-length answer. Realistic for agentic loops. The big drop you'll see at concurrency 4 (in the per-model detail page) is the more useful agent number.

Caveat: reasoning models

Models that stream a hidden reasoning_content channel before the user-visible answer (Qwen3.5/3.6, DeepSeek-R1, GPT-OSS reasoning variants) currently count those tokens in output_tok_per_s. The decode rate is honest, but the rate of useful answer text is lower because some of every token budget is spent on the hidden chain-of-thought. The schema flag model.reasoningis not yet reliable across providers, so the per-model detail pages don't mark them explicitly.

Next: separate reasoning_tokens_median from content_tokens_median in the harness, and add a reasoning-disabled run mode (per-model: Qwen enable_thinking: false, DeepSeek /no_think, etc.).

Coming soon

Queued for the next benchmark pass. Tracking notes live in docs/benchmark-campaign.md in the repo.

Strix quant sweep mirroring the 3090 Q2_K..Q6_K data on ROCm. Lemonade's async pull semantics broke the first attempt; switching to a raw llama-server invocation against the bundled ROCm binary.
Strix quant-creator comparison for one model from unsloth, bartowski, ggml-org where they all ship the same nominal quant.
Strix-only heavyweights: Mistral-Medium-3.5 128B, Mistral-Small-4 119B, Qwen3-Coder-Next 80B. Each is a 50-75 GB Q4_K_M download.
RTX 5070 CUDA pass. Currently Vulkan-only because CUDA toolkit install hit a packaging blocker on CachyOS; that's resolved now, so a llama.cpp CUDA build can land for a CUDA-vs-Vulkan comparison on the same NVIDIA silicon.
Driver and power-cap sweeps on the RTX 3090 once the rest of the matrix settles.