Benchmarks
Inference speed measurements for open-weight models across quantizations, backends, and hardware. Source YAMLs live in content/benchmarks/runs/.
Hardware tested(3 rigs · click for power caps, drivers, clocks, PCIe)
A self-built quad-3090 box that lives in the homelab as a general-purpose ML/inference node. Each card lives behind LXC GPU passthrough on Proxmox; the inference containers see 1 or 2 of the 4 cards depending on which model is running. Rows in the table that read "2× RTX 3090" use llama.cpp's --split-mode layer across two cards (the only way to fit Q8_0-class quants of 27B-class models on 24 GiB cards). Every other RTX 3090 row uses exactly one card.
- GPUs: 4× EVGA RTX 3090 FTW3 Ultra (24G-P5-3987-KR), running at the full 450 W cap. Earlier benchmarks at a 200 W rack-noise cap are noted in the per-run YAML and discussed in the power-limits post.
- CPU: AMD EPYC 7302P (16C/32T, Zen 2, SP3)
- Motherboard: ASRock Rack ROMED6U-2L2T
- Memory: 96 GiB DDR4-2933 (6× 16 GiB ECC RDIMM)
- Storage: 2 TB Samsung 980 Pro NVMe
- Chassis: MLACOM Quad Station Pro Lite v3
- Risers: 1× LINKUP AVA5 PCIe 5.0 straight 25 cm, 2× Okinos PCIe 4.0 150 mm, 1× Okinos PCIe 4.0 200 mm
- PSUs: 2× Corsair RM1200x Shift (renewed), bridged with a dual-PSU ATX adapter
A daily-driver gaming PC on CachyOS with an RTX 5070, pressed into service as a benchmark host between gaming sessions. The card sits at a 250 W of 300 W stock power cap (83%) by default on this rig; that limit is captured in the YAML and surfaced on each run.
Inference uses the prebuilt llama.cpp Vulkan binary (no CUDA toolkit or sudo on this host), so all RTX 5070 numbers here are Vulkan-backed rather than CUDA. That makes them directly comparable to the Strix Halo Vulkan numbers (same backend, different silicon) but understates what the card can do with CUDA. A CUDA pass will land later.
- GPU: NVIDIA GeForce RTX 5070, 12 GiB GDDR7, 250 W cap (300 W max)
- CPU: AMD Ryzen 9 7900 (12-core); the integrated Radeon iGPU is also visible to Vulkan as a second device but explicitly excluded from every bench via
--device Vulkan0 --split-mode none --main-gpu 0 - Driver: 595.58.03
- OS: CachyOS rolling, Linux 7.0
- VRAM-fit verification: every run snapshots GPU memory before and after the server starts and aborts if the delta is smaller than the model file size — guards against silent CPU spill
Framework Desktop with the AMD Ryzen AI Max+ 395 (Strix Halo) APU. 128 GiB of unified LPDDR5X system memory; the GPU side sees 96 GiB through the unified-memory pool. Integrated Radeon 8060S handles the inference workload via ROCm. No discrete GPU, no separate VRAM pool — the 27B-class models in this benchmark set all run on a single APU.
Click a column header to sort. Hover the dotted-underlined labels for definitions. When no shape is selected, gen tok/s shows a range across all workload shapes at the chosen concurrency.
▸ Gemma-4(72)released 2026-04
| Variant | Quant | Hardware | Backend | Mode | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|---|
| E2B-it | Q4_K_M | GeForce RTX 5070 · 12 GiB250 Wdrv 595 | llama.cpp b9174 (vulkan) | baseline | 1 | 206.5–216.7 |
| E2B-it | Q4_K_M | GeForce RTX 3090 · 24 GiB350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | 1 | 196.0–211.7 |
| E2B-it | Q4_K_M | GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | 1 | 192.3–207.5 |
| E2B-it | Q4_K_M | GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | 1 | 181.1–195.1 |
| E4B-it | Q4_K_M | GeForce RTX 3090 · 24 GiB350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | 1 | 122.5–137.3 |
| E4B-it | Q4_K_M | GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | 1 | 118.9–136.6 |
| E4B-it | Q4_K_M | GeForce RTX 5070 · 12 GiB250 Wdrv 595 | llama.cpp b9174 (vulkan) | baseline | 1 | 120.5–126.8 |
| E4B-it | Q4_K_M | GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | 1 | 101.9–118.4 |
| 26B-A4B-it | Q4_K_M | GeForce RTX 3090 · 24 GiB300 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline | 1 | 84.9–109.6 |
| E2B-it | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b8940 (rocm) | baseline | 1 | 76.3–87.6 |
| E4B-it | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b8940 (vulkan) | baseline | 1 | 50.4–53.8 |
| 26B-A4B-it | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b8940 (vulkan) | baseline | 1 | 43.2–47.9 |
| 31B-it | Q4_K_M | GeForce RTX 3090 · 24 GiB300 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline | 1 | 24.5–34.9 |
| 31B-it | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b1203 (rocm) | baseline | 1 | 9.1–10.2 |
▸ granite-4.1(28)released 2026-04
| Variant | Quant | Hardware | Backend | Mode | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|---|
| 8b | Q4_K_M | GeForce RTX 3090 · 24 GiB350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | 1 | 89.3–117.7 |
| 8b | Q4_K_M | GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | 1 | 92.1–115.5 |
| 8b | Q4_K_M | GeForce RTX 5070 · 12 GiB250 Wdrv 595 | llama.cpp b9174 (vulkan) | baseline | 1 | 91.5–98.5 |
| 8b | Q4_K_M | GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | 1 | 63.8–74.4 |
| 30b | Q4_K_M | GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | 1 | 36.5–40.9 |
| 30b | Q4_K_M | GeForce RTX 3090 · 24 GiB350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | 1 | 36.5–40.2 |
| 30b | Q4_K_M | GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | 1 | 19.5–21.0 |
▸ NVIDIA-Nemotron-3-Nano-Omni(12)released 2026-03
| Variant | Quant | Hardware | Backend | Mode | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|---|
| 30B-A3B-Reasoningthink | Q4_K_M | GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | 1 | 136.8–168.5 |
| 30B-A3B-Reasoningthink | Q4_K_M | GeForce RTX 3090 · 24 GiB350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | 1 | 132.4–166.2 |
| 30B-A3B-Reasoningthink | Q4_K_M | GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | 1 | 109.6–134.2 |
▸ Qwen3.6(196)released 2026-03
| Variant | Quant | Hardware | Backend | Mode | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|---|
| 35B-A3B-MTPthink | Q4_K_M | GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | MTP n=2 | 1 | 122.1–169.0 |
| 35B-A3B-MTPthink | Q4_K_M | GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | MTP n=3 | 1 | 136.6–161.4 |
| 35B-A3B-MTPthink | Q4_K_M | GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline | 1 | 110.6–135.6 |
| 35B-A3B-MTPthink | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp 4f13cb7-mtp (rocm) | MTP n=3 | 1 | 63.4–70.6 |
| 35B-A3B-MTPthink | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp 4f13cb7-mtp (rocm) | MTP n=2 | 1 | 60.8–70.0 |
| 27B-MTPthink | Q4_K_M | GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | MTP n=2 | 1 | 55.9–63.7 |
| 27B-MTPthink | Q4_K_M | GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | MTP n=3 | 1 | 54.6–59.2 |
| 27B-MTPthink | Q8_0 | 2× GeForce RTX 3090 · 24 GiB each450 W × 2drv 590 | llama.cpp cuda-4f13cb7 (cuda) | MTP n=3 | 1 | 53.8–57.1 |
| 27B-MTPthink | Q8_0 | 2× GeForce RTX 3090 · 24 GiB each450 W × 2drv 590 | llama.cpp cuda-4f13cb7 (cuda) | MTP n=2 | 1 | 47.8–53.2 |
| 35B-A3B-MTPthink | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp 4f13cb7-mtp (rocm) | baseline | 1 | 46.2–52.3 |
| 27B-MTPthink | Q8_0 | 2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 590 | llama.cpp 4f13cb7-mtp (cuda) | mtp-3-pl-200w | 1 | 45.1–50.6 |
| 27B-MTPthink | Q8_0 | 2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 590 | llama.cpp 4f13cb7-mtp (cuda) | mtp-2-pl-200w | 1 | 42.5–48.7 |
| 27Bthink | Q2_K | GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | 1 | 38.2–43.6 |
| 27Bthink | Q2_K | GeForce RTX 3090 · 24 GiB350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | 1 | 36.1–42.1 |
| 27B-MTPthink | Q4_K_M | GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline | 1 | 35.6–40.4 |
| 27Bthink | Q4_K_M | GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | 1 | 35.8–40.3 |
| 27Bthink | Q4_K_M | GeForce RTX 3090 · 24 GiB350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | 1 | 33.8–39.3 |
| 27Bthink | Q4_K_XL | GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | 1 | 34.6–38.8 |
| 27Bthink | Q4_K_XL | GeForce RTX 3090 · 24 GiB350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | 1 | 34.0–38.3 |
| 27Bthink | Q3_K_M | GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | 1 | 33.3–37.2 |
| 27Bthink | Q5_K_M | GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | 1 | 32.7–36.1 |
| 27Bthink | Q3_K_M | GeForce RTX 3090 · 24 GiB350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | 1 | 31.9–35.7 |
| 27Bthink | Q5_K_M | GeForce RTX 3090 · 24 GiB350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | 1 | 31.1–35.7 |
| 27B-MTPthink | Q4_K_M | GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 4f13cb7-mtp (cuda) | mtp-3-pl-200w | 1 | 31.1–34.2 |
| 27B-MTPthink | Q4_K_M | GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 4f13cb7-mtp (cuda) | mtp-2-pl-200w | 1 | 29.8–32.0 |
| 27Bthink | Q6_K | GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | 1 | 29.0–32.0 |
| 27Bthink | Q6_K | GeForce RTX 3090 · 24 GiB350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | 1 | 28.5–31.5 |
| 27B-MTPthink | Q8_0 | 2× GeForce RTX 3090 · 24 GiB each450 W × 2drv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline | 1 | 24.6–27.0 |
| 27B-MTPthink | Q8_0 | 2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 590 | llama.cpp 4f13cb7-mtp (cuda) | baseline-pl-200w | 1 | 23.5–26.1 |
| 27Bthink | Q2_K | GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | 1 | 22.1–24.3 |
| 27Bthink | Q4_K_M | GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | 1 | 20.0–21.6 |
| 27B-MTPthink | Q4_K_M | GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 4f13cb7-mtp (cuda) | baseline-pl-200w | 1 | 20.0–21.5 |
| 27B-MTPthink | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp 4f13cb7-mtp (rocm) | MTP n=3 | 1 | 19.9–21.2 |
| 27Bthink | Q4_K_XL | GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | 1 | 19.7–21.2 |
| 27Bthink | Q3_K_M | GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | 1 | 19.3–20.9 |
| 27B-MTPthink | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp 4f13cb7-mtp (rocm) | MTP n=2 | 1 | 18.8–20.0 |
| 27Bthink | Q5_K_M | GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | 1 | 17.3–18.9 |
| 27B-MTPthink | Q8_0 | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp 4f13cb7-mtp (rocm) | MTP n=3 | 1 | 17.1–18.7 |
| 27B-MTPthink | Q8_0 | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp 4f13cb7-mtp (rocm) | MTP n=2 | 1 | 15.4–16.1 |
| 27Bthink | Q6_K | GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | 1 | 14.4–15.5 |
| 27Bthink | Q3_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp rocm-4f13cb7 (rocm) | baseline | 1 | 12.9–14.2 |
| 27Bthink | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp rocm-4f13cb7 (rocm) | baseline | 1 | 11.1–12.0 |
| 27B-MTPthink | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp 4f13cb7-mtp (rocm) | baseline | 1 | 11.1–12.0 |
| 27Bthink | Q4_K_XL | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b8940 (vulkan) | baseline | 1 | 10.8–11.9 |
| 27Bthink | Q5_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp rocm-4f13cb7 (rocm) | baseline | 1 | 9.9–10.6 |
| 27Bthink | Q6_K | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp rocm-4f13cb7 (rocm) | baseline | 1 | 8.7–9.3 |
| 27B-MTPthink | Q8_0 | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp 4f13cb7-mtp (rocm) | baseline | 1 | 7.3–7.7 |
| 27Bthink | Q8_0 | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp rocm-4f13cb7 (rocm) | baseline | 1 | 7.3–7.7 |
▸ LFM2.5(8)released 2025-11
▸ LFM2.5-350M(8)released 2025-11
▸ Qwen3.5(32)released 2025-10
| Variant | Quant | Hardware | Backend | Mode | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|---|
| 35B-A3Bthink | Q4_K_M | GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | 1 | 108.3–137.1 |
| 35B-A3Bthink | Q4_K_M | GeForce RTX 3090 · 24 GiB350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | 1 | 110.4–136.1 |
| 35B-A3Bthink | Q4_K_XL | GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | 1 | 94.2–119.3 |
| 35B-A3Bthink | Q4_K_XL | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b1203 (rocm) | baseline | 1 | 42.4–48.3 |
| 27Bthink | Q4_K_M | GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | 1 | 34.1–40.5 |
| 27Bthink | Q4_K_M | GeForce RTX 3090 · 24 GiB350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | 1 | 35.2–39.8 |
| 27Bthink | Q4_K_XL | GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | 1 | 19.8–21.4 |
| 27Bthink | Q4_K_XL | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b1203 (rocm) | baseline | 1 | 10.9–11.9 |
▸ GLM-4.7-Flash(4)released 2025-09
| Variant | Quant | Hardware | Backend | Mode | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|---|
| Flash | Q4_K_XL | GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | 1 | 105.4–117.5 |
▸ LFM2(72)released 2025-07
| Variant | Quant | Hardware | Backend | Mode | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|---|
| 1.2B | Q4_K_M | GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | 1 | 507.6–579.5 |
| 1.2B-Tool | Q4_K_M | GeForce RTX 3090 · 24 GiB350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | 1 | 513.1–578.9 |
| 1.2B-Tool | Q4_K_M | GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | 1 | 499.2–578.7 |
| 1.2B | Q4_K_M | GeForce RTX 3090 · 24 GiB350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | 1 | 454.9–565.0 |
| 1.2B | Q4_K_M | GeForce RTX 5070 · 12 GiB250 Wdrv 595 | llama.cpp b9174 (vulkan) | baseline | 1 | 485.5–529.6 |
| 1.2B-Tool | Q4_K_M | GeForce RTX 5070 · 12 GiB250 Wdrv 595 | llama.cpp b9174 (vulkan) | baseline | 1 | 475.8–522.1 |
| 1.2B | Q4_K_M | GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | 1 | 426.4–471.0 |
| 1.2B-Tool | Q4_K_M | GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | 1 | 423.6–465.3 |
| 8B-A1B | Q4_K_M | GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | 1 | 299.8–423.4 |
| 8B-A1B | Q4_K_M | GeForce RTX 3090 · 24 GiB350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | 1 | 341.7–406.4 |
| 8B-A1B | Q4_K_M | GeForce RTX 5070 · 12 GiB250 Wdrv 595 | llama.cpp b9174 (vulkan) | baseline | 1 | 319.2–364.9 |
| 8B-A1B | Q4_K_M | GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | 1 | 278.6–332.9 |
| 2.6B | Q4_K_M | GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | 1 | 262.9–314.4 |
| 2.6B | Q4_K_M | GeForce RTX 3090 · 24 GiB350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | 1 | 253.0–306.7 |
| 2.6B | Q4_K_M | GeForce RTX 5070 · 12 GiB250 Wdrv 595 | llama.cpp b9174 (vulkan) | baseline | 1 | 249.9–268.6 |
| 2.6B | Q4_K_M | GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | 1 | 221.1–238.9 |
| 1.2B | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b8940 (rocm) | baseline | 1 | 194.6–208.8 |
| 8B-A1B | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b8940 (rocm) | baseline | 1 | 142.6–151.4 |
▸ Qwen3-Coder(12)released 2025-06
| Variant | Quant | Hardware | Backend | Mode | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|---|
| 30B-A3B-Instructthink | Q4_K_XL | GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | 1 | 146.4–183.6 |
| 30B-A3B-Instructthink | Q4_K_XL | GeForce RTX 3090 · 24 GiB350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | 1 | 127.2–181.3 |
| 30B-A3B-Instructthink | Q4_K_XL | GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | 1 | 117.1–152.7 |
▸ Gemma-3(32)released 2025-03
| Variant | Quant | Hardware | Backend | Mode | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|---|
| 4b-it | Q4_K_M | GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | 1 | 95.5–169.4 |
| 4b-it | Q4_K_M | GeForce RTX 5070 · 12 GiB250 Wdrv 595 | llama.cpp cuda-1a68ec9 (cuda) | baseline | 1 | 143.0–168.7 |
| 4b-it | Q4_K_M | GeForce RTX 3090 · 24 GiB350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | 1 | 101.6–166.7 |
| 4b-it | Q4_K_M | GeForce RTX 3090 · 24 GiB300 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline | 1 | 97.1–160.9 |
| 4b-it | Q4_K_M | GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-200w | 1 | 84.0–128.5 |
| 4b-it | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b1203 (rocm) | baseline | 1 | 55.5–64.6 |
▸ Qwen2.5-Coder(52)released 2024-11
| Variant | Quant | Hardware | Backend | Mode | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|---|
| 7B-Instruct | AWQ | GeForce RTX 3090 · 24 GiB420 Wdrv 590 | vLLM 0.21.0 (cuda) | baseline-pl-350w | 1 | 138.3–148.6 |
| 7B-Instruct | AWQ | GeForce RTX 3090 · 24 GiB420 Wdrv 590 | vLLM 0.21.0 (cuda) | baseline-pl-450w | 1 | 136.8–148.4 |
| 7B-Instruct | Q4_K_M | GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | 1 | 125.4–146.4 |
| 7B-Instruct | Q4_K_M | GeForce RTX 3090 · 24 GiB350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | 1 | 104.9–139.2 |
| 7B-Instruct | Q4_K_M | GeForce RTX 5070 · 12 GiB250 Wdrv 595 | llama.cpp b9174 (vulkan) | baseline | 1 | 110.5–119.4 |
| 7B-Instruct | Q4_K_M | GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | 1 | 77.5–88.6 |
| 7B-Instruct | AWQ | GeForce RTX 3090 · 24 GiB200 Wdrv 590 | vLLM 0.21.0 (cuda) | baseline | 1 | 76.9–85.8 |
| 14B-Instruct | AWQ | GeForce RTX 3090 · 24 GiB420 Wdrv 590 | vLLM 0.21.0 (cuda) | baseline-pl-350w | 1 | 77.3–81.2 |
| 14B-Instruct | AWQ | GeForce RTX 3090 · 24 GiB420 Wdrv 590 | vLLM 0.21.0 (cuda) | baseline-pl-450w | 1 | 77.2–81.1 |
| 14B-Instruct | AWQ | GeForce RTX 3090 · 24 GiB200 Wdrv 590 | vLLM 0.21.0 (cuda) | baseline | 1 | 39.9–42.6 |
| 32B-Instruct | AWQ | GeForce RTX 3090 · 24 GiB420 Wdrv 590 | vLLM 0.21.0 (cuda) | baseline-pl-350w | 1 | 40.1–41.3 |
| 32B-Instruct | AWQ | GeForce RTX 3090 · 24 GiB420 Wdrv 590 | vLLM 0.21.0 (cuda) | baseline-pl-450w | 1 | 40.0–41.3 |
| 32B-Instruct | AWQ | GeForce RTX 3090 · 24 GiB200 Wdrv 590 | vLLM 0.21.0 (cuda) | baseline | 1 | 18.9–19.5 |
▸ Qwen2.5(36)released 2024-09
| Variant | Quant | Hardware | Backend | Mode | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|---|
| 7B-Instruct | AWQ | GeForce RTX 3090 · 24 GiB420 Wdrv 590 | vLLM 0.21.0 (cuda) | baseline-pl-350w | 1 | 140.1–148.8 |
| 7B-Instruct | AWQ | GeForce RTX 3090 · 24 GiB420 Wdrv 590 | vLLM 0.21.0 (cuda) | baseline-pl-450w | 1 | 141.6–148.3 |
| 7B-Instruct | AWQ | GeForce RTX 3090 · 24 GiB200 Wdrv 590 | vLLM 0.21.0 (cuda) | baseline | 1 | 77.0–85.2 |
| 14B-Instruct | AWQ | GeForce RTX 3090 · 24 GiB420 Wdrv 590 | vLLM 0.21.0 (cuda) | baseline-pl-350w | 1 | 76.0–81.0 |
| 14B-Instruct | AWQ | GeForce RTX 3090 · 24 GiB420 Wdrv 590 | vLLM 0.21.0 (cuda) | baseline-pl-450w | 1 | 76.3–81.0 |
| 14B-Instruct | AWQ | GeForce RTX 3090 · 24 GiB200 Wdrv 590 | vLLM 0.21.0 (cuda) | baseline | 1 | 38.9–42.6 |
| 32B-Instruct | AWQ | GeForce RTX 3090 · 24 GiB420 Wdrv 590 | vLLM 0.21.0 (cuda) | baseline-pl-350w | 1 | 39.9–41.3 |
| 32B-Instruct | AWQ | GeForce RTX 3090 · 24 GiB420 Wdrv 590 | vLLM 0.21.0 (cuda) | baseline-pl-450w | 1 | 39.9–41.3 |
| 32B-Instruct | AWQ | GeForce RTX 3090 · 24 GiB200 Wdrv 590 | vLLM 0.21.0 (cuda) | baseline | 1 | 18.8–19.3 |
Tok/s by workload (concurrency 1)
Same models, four different usage patterns. Prefill and decode are bound by different limits, so the same model produces noticeably different tok/s depending on prompt and answer length. KV-cache size, batch size, and any hidden reasoningtokens the model emits also move the number. Use the ranges in "Model speed range" above as the headline; this section explains why those ranges exist.
Short prompt, short answer. Generation-bound, so output tok/s is a clean reflection of the model's peak decode rate on this hardware.
Long stuffed-context prompt, short answer. Prefill dominates time-to-first-token; gen tok/s usually dips slightly because the KV cache is hot but bigger.
Short prompt, long answer (~1k tokens). Pure decode loop. Numbers here tend to be the closest to the model's sustained ceiling.
Mid-length prompt with tool-call shape, mid-length answer. Realistic for agentic loops. The big drop you'll see at concurrency 4 (in the per-model detail page) is the more useful agent number.
Models that stream a hidden reasoning_content channel before the user-visible answer (Qwen3.5/3.6, DeepSeek-R1, GPT-OSS reasoning variants) currently count those tokens in output_tok_per_s. The decode rate is honest, but the rate of useful answer text is lower because some of every token budget is spent on the hidden chain-of-thought. The schema flag model.reasoningis not yet reliable across providers, so the per-model detail pages don't mark them explicitly.
Next: separate reasoning_tokens_median from content_tokens_median in the harness, and add a reasoning-disabled run mode (per-model: Qwen enable_thinking: false, DeepSeek /no_think, etc.).
Coming soon
Queued for the next benchmark pass. Tracking notes live in docs/benchmark-campaign.md in the repo.
- Strix quant sweep mirroring the 3090 Q2_K..Q6_K data on ROCm. Lemonade's async pull semantics broke the first attempt; switching to a raw
llama-serverinvocation against the bundled ROCm binary. - Strix quant-creator comparison for one model from unsloth, bartowski, ggml-org where they all ship the same nominal quant.
- Strix-only heavyweights: Mistral-Medium-3.5 128B, Mistral-Small-4 119B, Qwen3-Coder-Next 80B. Each is a 50-75 GB Q4_K_M download.
- RTX 5070 CUDA pass. Currently Vulkan-only because CUDA toolkit install hit a packaging blocker on CachyOS; that's resolved now, so a llama.cpp CUDA build can land for a CUDA-vs-Vulkan comparison on the same NVIDIA silicon.
- Driver and power-cap sweeps on the RTX 3090 once the rest of the matrix settles.