Benchmarks
Inference speed measurements for open-weight models across quantizations, backends, and hardware. Source YAMLs live in content/benchmarks/runs/.
Hardware tested(1 rig · click for power caps, drivers, clocks, PCIe)
A daily-driver gaming PC on CachyOS with an RTX 5070, pressed into service as a benchmark host between gaming sessions. The card sits at a 250 W of 300 W stock power cap (83%) by default on this rig; that limit is captured in the YAML and surfaced on each run.
Inference uses the prebuilt llama.cpp Vulkan binary (no CUDA toolkit or sudo on this host), so all RTX 5070 numbers here are Vulkan-backed rather than CUDA. That makes them directly comparable to the Strix Halo Vulkan numbers (same backend, different silicon) but understates what the card can do with CUDA. A CUDA pass will land later.
- GPU: NVIDIA GeForce RTX 5070, 12 GiB GDDR7, 250 W cap (300 W max)
- CPU: AMD Ryzen 9 7900 (12-core); the integrated Radeon iGPU is also visible to Vulkan as a second device but explicitly excluded from every bench via
--device Vulkan0 --split-mode none --main-gpu 0 - Driver: 595.58.03
- OS: CachyOS rolling, Linux 7.0
- VRAM-fit verification: every run snapshots GPU memory before and after the server starts and aborts if the delta is smaller than the model file size — guards against silent CPU spill
Click a column header to sort. Hover the dotted-underlined labels for definitions. When no shape is selected, gen tok/s shows a range across all workload shapes at the chosen concurrency.
▸ Gemma-4(8)released 2026-04
▸ granite-4.1(4)released 2026-04
| Variant | Quant | Hardware | Backend | Mode | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|---|
| 8b | Q4_K_M | GeForce RTX 5070 · 12 GiB250 Wdrv 595 | llama.cpp b9174 (vulkan) | baseline | 1 | 91.5–98.5 |
▸ LFM2.5-350M(4)released 2025-11
| Variant | Quant | Hardware | Backend | Mode | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|---|
| 350M | Q4_K_M | GeForce RTX 5070 · 12 GiB250 Wdrv 595 | llama.cpp b9174 (vulkan) | baseline | 1 | 761.9–861.4 |
▸ LFM2(16)released 2025-07
| Variant | Quant | Hardware | Backend | Mode | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|---|
| 1.2B | Q4_K_M | GeForce RTX 5070 · 12 GiB250 Wdrv 595 | llama.cpp b9174 (vulkan) | baseline | 1 | 485.5–529.6 |
| 1.2B-Tool | Q4_K_M | GeForce RTX 5070 · 12 GiB250 Wdrv 595 | llama.cpp b9174 (vulkan) | baseline | 1 | 475.8–522.1 |
| 8B-A1B | Q4_K_M | GeForce RTX 5070 · 12 GiB250 Wdrv 595 | llama.cpp b9174 (vulkan) | baseline | 1 | 319.2–364.9 |
| 2.6B | Q4_K_M | GeForce RTX 5070 · 12 GiB250 Wdrv 595 | llama.cpp b9174 (vulkan) | baseline | 1 | 249.9–268.6 |
▸ Gemma-3(12)released 2025-03
| Variant | Quant | Hardware | Backend | Mode | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|---|
| 4b-it | Q4_K_M | GeForce RTX 5070 · 12 GiB250 Wdrv 595 | llama.cpp cuda-1a68ec9 (cuda) | baseline | 1 | 143.0–168.7 |
▸ Qwen2.5-Coder(4)released 2024-11
| Variant | Quant | Hardware | Backend | Mode | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|---|
| 7B-Instruct | Q4_K_M | GeForce RTX 5070 · 12 GiB250 Wdrv 595 | llama.cpp b9174 (vulkan) | baseline | 1 | 110.5–119.4 |
Tok/s by workload (concurrency 1)
Same models, four different usage patterns. Prefill and decode are bound by different limits, so the same model produces noticeably different tok/s depending on prompt and answer length. KV-cache size, batch size, and any hidden reasoningtokens the model emits also move the number. Use the ranges in "Model speed range" above as the headline; this section explains why those ranges exist.
Short prompt, short answer. Generation-bound, so output tok/s is a clean reflection of the model's peak decode rate on this hardware.
Long stuffed-context prompt, short answer. Prefill dominates time-to-first-token; gen tok/s usually dips slightly because the KV cache is hot but bigger.
Short prompt, long answer (~1k tokens). Pure decode loop. Numbers here tend to be the closest to the model's sustained ceiling.
Mid-length prompt with tool-call shape, mid-length answer. Realistic for agentic loops. The big drop you'll see at concurrency 4 (in the per-model detail page) is the more useful agent number.
Models that stream a hidden reasoning_content channel before the user-visible answer (Qwen3.5/3.6, DeepSeek-R1, GPT-OSS reasoning variants) currently count those tokens in output_tok_per_s. The decode rate is honest, but the rate of useful answer text is lower because some of every token budget is spent on the hidden chain-of-thought. The schema flag model.reasoningis not yet reliable across providers, so the per-model detail pages don't mark them explicitly.
Next: separate reasoning_tokens_median from content_tokens_median in the harness, and add a reasoning-disabled run mode (per-model: Qwen enable_thinking: false, DeepSeek /no_think, etc.).
Coming soon
Queued for the next benchmark pass. Tracking notes live in docs/benchmark-campaign.md in the repo.
- Strix quant sweep mirroring the 3090 Q2_K..Q6_K data on ROCm. Lemonade's async pull semantics broke the first attempt; switching to a raw
llama-serverinvocation against the bundled ROCm binary. - Strix quant-creator comparison for one model from unsloth, bartowski, ggml-org where they all ship the same nominal quant.
- Strix-only heavyweights: Mistral-Medium-3.5 128B, Mistral-Small-4 119B, Qwen3-Coder-Next 80B. Each is a 50-75 GB Q4_K_M download.
- RTX 5070 CUDA pass. Currently Vulkan-only because CUDA toolkit install hit a packaging blocker on CachyOS; that's resolved now, so a llama.cpp CUDA build can land for a CUDA-vs-Vulkan comparison on the same NVIDIA silicon.
- Driver and power-cap sweeps on the RTX 3090 once the rest of the matrix settles.