Benchmarks
Inference speed measurements for open-weight models across quantizations, backends, and hardware. Source YAMLs live in content/benchmarks/runs/.
Click a column header to sort. Hover the dotted-underlined labels for definitions. When no shape is selected, gen tok/s shows a range across all workload shapes at the chosen concurrency.
▸ Gemma-4(56)released 2026-04
| Variant | Quant | Hardware | Backend | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|
| E2B-it | Q4_K_M | GeForce RTX 5070 · 11.94 GiB | llama.cpp b9174 (cuda) | 1 | 206.5–216.7 |
| E2B-it | Q4_K_M | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 181.1–195.1 |
| E4B-it | Q4_K_M | GeForce RTX 5070 · 11.94 GiB | llama.cpp b9174 (cuda) | 1 | 120.5–126.8 |
| E4B-it | Q4_K_M | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 101.9–118.4 |
| 26B-A4B-it | Q4_K_M | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 74.7–101.4 |
| E2B-it | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b8940 (rocm) | 1 | 76.3–87.6 |
| E4B-it | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b8940 (vulkan) | 1 | 50.4–53.8 |
| E4B-it | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b1203 (rocm) | 1 | 48.1–52.4 |
| 26B-A4B-it | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b8940 (vulkan) | 1 | 43.2–47.9 |
| 26B-A4B-it | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b1203 (rocm) | 1 | 40.7–46.0 |
| 31B-it | Q4_K_M | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 15.2–19.4 |
| 31B-it | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b1203 (rocm) | 1 | 9.1–10.2 |
| E4B-it | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b8940 (cpu) | 1 | 8.6–10.2 |
| 26B-A4B-it | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b8940 (cpu) | 1 | 6.9–8.5 |
▸ granite-4.1(12)released 2026-04
▸ NVIDIA-Nemotron-3-Nano-Omni(4)released 2026-03
| Variant | Quant | Hardware | Backend | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|
| 30B-A3B-Reasoning | Q4_K_M | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 109.6–134.2 |
▸ Qwen3.6(32)released 2026-03
| Variant | Quant | Hardware | Backend | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|
| 27B-GGUF-Q2_K | Q2_K | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 22.1–24.3 |
| 27B-GGUF-Q4_K_M | Q4_K_M | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 20.0–21.6 |
| 27B | Q4_K_XL | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 19.7–21.2 |
| 27B-GGUF-Q3_K_M | Q3_K_M | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 19.3–20.9 |
| 27B-GGUF-Q5_K_M | Q5_K_M | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 17.3–18.9 |
| 27B-GGUF-Q6_K | Q6_K | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 14.4–15.5 |
| 27B | Q4_K_XL | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b8940 (vulkan) | 1 | 10.8–11.9 |
| 27B | Q4_K_XL | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b1203 (rocm) | 1 | 10.5–11.5 |
▸ LFM2.5-350M(8)released 2025-11
▸ Qwen3.5(16)released 2025-10
| Variant | Quant | Hardware | Backend | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|
| 35B-A3B | Q4_K_XL | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 94.2–119.3 |
| 35B-A3B | Q4_K_XL | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b1203 (rocm) | 1 | 42.4–48.3 |
| 27B | Q4_K_XL | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 19.8–21.4 |
| 27B | Q4_K_XL | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b1203 (rocm) | 1 | 10.9–11.9 |
▸ GLM-4.7-Flash(4)released 2025-09
| Variant | Quant | Hardware | Backend | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|
| Flash | Q4_K_XL | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 105.4–117.5 |
▸ LFM2(40)released 2025-07
| Variant | Quant | Hardware | Backend | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|
| 1.2B | Q4_K_M | GeForce RTX 5070 · 11.94 GiB | llama.cpp b9174 (cuda) | 1 | 485.5–529.6 |
| 1.2B-Tool | Q4_K_M | GeForce RTX 5070 · 11.94 GiB | llama.cpp b9174 (cuda) | 1 | 475.8–522.1 |
| 1.2B | Q4_K_M | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 426.4–471.0 |
| 1.2B-Tool | Q4_K_M | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 423.6–465.3 |
| 8B-A1B | Q4_K_M | GeForce RTX 5070 · 11.94 GiB | llama.cpp b9174 (cuda) | 1 | 319.2–364.9 |
| 8B-A1B | Q4_K_M | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 278.6–332.9 |
| 2.6B | Q4_K_M | GeForce RTX 5070 · 11.94 GiB | llama.cpp b9174 (cuda) | 1 | 249.9–268.6 |
| 2.6B | Q4_K_M | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 221.1–238.9 |
| 1.2B | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b8940 (rocm) | 1 | 194.6–208.8 |
| 8B-A1B | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b8940 (rocm) | 1 | 142.6–151.4 |
▸ Qwen3-Coder(4)released 2025-06
| Variant | Quant | Hardware | Backend | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|
| 30B-A3B-Instruct | Q4_K_XL | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 117.1–152.7 |
▸ Gemma-3(12)released 2025-03
▸ Qwen2.5-Coder(8)released 2024-11
| Variant | Quant | Hardware | Backend | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|
| 7B-Instruct | Q4_K_M | GeForce RTX 5070 · 11.94 GiB | llama.cpp b9174 (cuda) | 1 | 110.5–119.4 |
| 7B-Instruct | Q4_K_M | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 77.5–88.6 |
▸ Qwen/Qwen2.5-Coder(12)released 2024-11
| Variant | Quant | Hardware | Backend | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|
| 7B-Instruct | unknown | GeForce RTX 3090 · 24 GiB | vLLM 0.21.0 (cuda) | 1 | 76.9–85.8 |
| 14B-Instruct | unknown | GeForce RTX 3090 · 24 GiB | vLLM 0.21.0 (cuda) | 1 | 39.9–42.6 |
| 32B-Instruct | unknown | GeForce RTX 3090 · 24 GiB | vLLM 0.21.0 (cuda) | 1 | 18.9–19.5 |
▸ Qwen/Qwen2.5(12)released 2024-09
| Variant | Quant | Hardware | Backend | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|
| 7B-Instruct | unknown | GeForce RTX 3090 · 24 GiB | vLLM 0.21.0 (cuda) | 1 | 77.0–85.2 |
| 14B-Instruct | unknown | GeForce RTX 3090 · 24 GiB | vLLM 0.21.0 (cuda) | 1 | 38.9–42.6 |
| 32B-Instruct | unknown | GeForce RTX 3090 · 24 GiB | vLLM 0.21.0 (cuda) | 1 | 18.8–19.3 |
Tok/s by workload (concurrency 1)
Same models, four different usage patterns. Prefill and decode are bound by different limits, so the same model produces noticeably different tok/s depending on prompt and answer length. KV-cache size, batch size, and any hidden reasoningtokens the model emits also move the number. Use the ranges in "Model speed range" above as the headline; this section explains why those ranges exist.
Short prompt, short answer. Generation-bound, so output tok/s is a clean reflection of the model's peak decode rate on this hardware.
Long stuffed-context prompt, short answer. Prefill dominates time-to-first-token; gen tok/s usually dips slightly because the KV cache is hot but bigger.
Short prompt, long answer (~1k tokens). Pure decode loop. Numbers here tend to be the closest to the model's sustained ceiling.
Mid-length prompt with tool-call shape, mid-length answer. Realistic for agentic loops. The big drop you'll see at concurrency 4 (in the per-model detail page) is the more useful agent number.
Models that stream a hidden reasoning_content channel before the user-visible answer (Qwen3.5/3.6, DeepSeek-R1, GPT-OSS reasoning variants) currently count those tokens in output_tok_per_s. The decode rate is honest, but the rate of useful answer text is lower because some of every token budget is spent on the hidden chain-of-thought. The schema flag model.reasoningis not yet reliable across providers, so the per-model detail pages don't mark them explicitly.
Next: separate reasoning_tokens_median from content_tokens_median in the harness, and add a reasoning-disabled run mode (per-model: Qwen enable_thinking: false, DeepSeek /no_think, etc.).
Hardware tested
The rigs producing the numbers above. Use the hardware filter at the top of the page to scope results to a specific machine.
A self-built quad-3090 box that lives in the homelab as a general-purpose ML/inference node. Unless a run explicitly labels itself as multi-GPU, every RTX 3090 result on this page uses exactly one card via LXC GPU passthrough on Proxmox (/dev/nvidia0 for the vLLM container, /dev/nvidia1 for the llama.cpp container). Tensor-parallel and multi-card numbers will land separately and be tagged.
- GPUs: 4× EVGA RTX 3090 FTW3 Ultra (24G-P5-3987-KR), each capped at 200 W of 450 W stock for thermals and PSU headroom
- CPU: AMD EPYC 7302P (16C/32T, Zen 2, SP3)
- Motherboard: ASRock Rack ROMED6U-2L2T
- Memory: 96 GiB DDR4-2933 (6× 16 GiB ECC RDIMM)
- Storage: 2 TB Samsung 980 Pro NVMe
- Chassis: MLACOM Quad Station Pro Lite v3
- Risers: 1× LINKUP AVA5 PCIe 5.0 straight 25 cm, 2× Okinos PCIe 4.0 150 mm, 1× Okinos PCIe 4.0 200 mm
- PSUs: 2× Corsair RM1200x Shift (renewed), bridged with a dual-PSU ATX adapter
Framework Desktop with the AMD Ryzen AI Max+ 395 (Strix Halo) APU. 128 GiB of unified LPDDR5X system memory; the GPU side sees 96 GiB through the unified-memory pool. Integrated Radeon 8060S handles the inference workload via ROCm. No discrete GPU, no separate VRAM pool — the 27B-class models in this benchmark set all run on a single APU.
Coming soon
Queued for the next benchmark pass. Tracking notes live in docs/benchmark-campaign.md in the repo.
- Strix vLLM FP8 + MTP-1 + draft-spec on Qwen3.6-27B. Blocked on lemonade's hardcoded backend-readiness timeout cutting off the first-load FP8 kernel autotune. Bypass via the bundled vLLM binary to warm the cache, then hand back to lemonade.
- Strix quant sweep mirroring the 3090 Q2_K..Q6_K data on ROCm. Lemonade's async pull semantics broke the first attempt; switching to a raw
llama-serverinvocation against the bundled ROCm binary. - Strix quant-creator comparison for one model from unsloth, bartowski, ggml-org where they all ship the same nominal quant.
- Strix-only heavyweights: Mistral-Medium-3.5 128B, Mistral-Small-4 119B, Qwen3-Coder-Next 80B. Each is a 50-75 GB Q4_K_M download.
- RTX 5070 CUDA pass. Currently Vulkan-only because CUDA toolkit install hit a packaging blocker on CachyOS; that's resolved now, so a llama.cpp CUDA build can land for a CUDA-vs-Vulkan comparison on the same NVIDIA silicon.
- Driver and power-cap sweeps on the RTX 3090 once the rest of the matrix settles.