Benchmarks

Inference speed measurements for open-weight models across quantizations, backends, and hardware. Source YAMLs live in content/benchmarks/runs/.

Hardware tested(3 rigs · click for power caps, drivers, clocks, PCIe)
Custom 4× RTX 3090 build· Open-frame mining-style chassis
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power200 W / 450 W max(44% cap)
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driver590.48.01
backendsllama.cpp 59778f0 (cuda), llama.cpp cuda-4f13cb7 (cuda), llama.cpp 4f13cb7-mtp (cuda), vLLM 0.21.0 (cuda)

A self-built quad-3090 box that lives in the homelab as a general-purpose ML/inference node. Each card lives behind LXC GPU passthrough on Proxmox; the inference containers see 1 or 2 of the 4 cards depending on which model is running. Rows in the table that read "2× RTX 3090" use llama.cpp's --split-mode layer across two cards (the only way to fit Q8_0-class quants of 27B-class models on 24 GiB cards). Every other RTX 3090 row uses exactly one card.

  • GPUs: 4× EVGA RTX 3090 FTW3 Ultra (24G-P5-3987-KR), running at the full 450 W cap. Earlier benchmarks at a 200 W rack-noise cap are noted in the per-run YAML and discussed in the power-limits post.
  • CPU: AMD EPYC 7302P (16C/32T, Zen 2, SP3)
  • Motherboard: ASRock Rack ROMED6U-2L2T
  • Memory: 96 GiB DDR4-2933 (6× 16 GiB ECC RDIMM)
  • Storage: 2 TB Samsung 980 Pro NVMe
  • Chassis: MLACOM Quad Station Pro Lite v3
  • Risers: 1× LINKUP AVA5 PCIe 5.0 straight 25 cm, 2× Okinos PCIe 4.0 150 mm, 1× Okinos PCIe 4.0 200 mm
  • PSUs: 2× Corsair RM1200x Shift (renewed), bridged with a dual-PSU ATX adapter
Gaming desktop· Custom build
cpuAMD Ryzen 9 7900 12-Core Processor
gpuNVIDIA GeForce RTX 5070
archNVIDIA
vram11.94 GiB (system 30.5 GiB)
power250 W / 300 W max(83% cap)
osCachyOS
kernel7.0.8-1-cachyos
driver595.71.05
backendsllama.cpp cuda-1a68ec9 (cuda), llama.cpp vulkan-1a68ec9 (vulkan), llama.cpp b9174 (vulkan)

A daily-driver gaming PC on CachyOS with an RTX 5070, pressed into service as a benchmark host between gaming sessions. The card sits at a 250 W of 300 W stock power cap (83%) by default on this rig; that limit is captured in the YAML and surfaced on each run.

Inference uses the prebuilt llama.cpp Vulkan binary (no CUDA toolkit or sudo on this host), so all RTX 5070 numbers here are Vulkan-backed rather than CUDA. That makes them directly comparable to the Strix Halo Vulkan numbers (same backend, different silicon) but understates what the card can do with CUDA. A CUDA pass will land later.

  • GPU: NVIDIA GeForce RTX 5070, 12 GiB GDDR7, 250 W cap (300 W max)
  • CPU: AMD Ryzen 9 7900 (12-core); the integrated Radeon iGPU is also visible to Vulkan as a second device but explicitly excluded from every bench via --device Vulkan0 --split-mode none --main-gpu 0
  • Driver: 595.58.03
  • OS: CachyOS rolling, Linux 7.0
  • VRAM-fit verification: every run snapshots GPU memory before and after the server starts and aborts if the delta is smaller than the model file size — guards against silent CPU spill
Framework Desktop· Mini-ITX
cpuAMD RYZEN AI MAX+ 395 w/ Radeon 8060S
gpuAMD Radeon 8060S
archStrix Halo (gfx1151)
vram96 GiB (system 31.1 GiB, unified)
osUbuntu 24.04.4 LTS
kernel7.0.2-2-pve
backendsllama.cpp b1203 (rocm), llama.cpp b8940 (cpu), llama.cpp b8940 (vulkan), llama.cpp b8940 (rocm), llama.cpp b1203 (rocm), llama.cpp rocm-4f13cb7 (rocm), llama.cpp 4f13cb7-mtp (rocm)

Framework Desktop with the AMD Ryzen AI Max+ 395 (Strix Halo) APU. 128 GiB of unified LPDDR5X system memory; the GPU side sees 96 GiB through the unified-memory pool. Integrated Radeon 8060S handles the inference workload via ROCm. No discrete GPU, no separate VRAM pool — the 27B-class models in this benchmark set all run on a single APU.

Click a column header to sort. Hover the dotted-underlined labels for definitions. When no shape is selected, gen tok/s shows a range across all workload shapes at the chosen concurrency.

Gemma-4(72)released 2026-04

VariantQuantHardwareBackendModeConc.Gen tok/s
E2B-itQ4_K_M
GeForce RTX 5070 · 12 GiB250 Wdrv 595
llama.cpp b9174 (vulkan)baseline1
206.5216.7
E2B-itQ4_K_M
GeForce RTX 3090 · 24 GiB350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350w1
196.0211.7
E2B-itQ4_K_M
GeForce RTX 3090 · 24 GiB450 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450w1
192.3207.5
E2B-itQ4_K_M
GeForce RTX 3090 · 24 GiB200 Wdrv 590
llama.cpp 59778f0 (cuda)baseline1
181.1195.1
E4B-itQ4_K_M
GeForce RTX 3090 · 24 GiB350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350w1
122.5137.3
E4B-itQ4_K_M
GeForce RTX 3090 · 24 GiB450 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450w1
118.9136.6
E4B-itQ4_K_M
GeForce RTX 5070 · 12 GiB250 Wdrv 595
llama.cpp b9174 (vulkan)baseline1
120.5126.8
E4B-itQ4_K_M
GeForce RTX 3090 · 24 GiB200 Wdrv 590
llama.cpp 59778f0 (cuda)baseline1
101.9118.4
26B-A4B-itQ4_K_M
GeForce RTX 3090 · 24 GiB300 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline1
84.9109.6
E2B-itQ4_K_M
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp b8940 (rocm)baseline1
76.387.6
E4B-itQ4_K_M
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp b8940 (vulkan)baseline1
50.453.8
26B-A4B-itQ4_K_M
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp b8940 (vulkan)baseline1
43.247.9
31B-itQ4_K_M
GeForce RTX 3090 · 24 GiB300 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline1
24.534.9
31B-itQ4_K_M
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp b1203 (rocm)baseline1
9.110.2

granite-4.1(28)released 2026-04

VariantQuantHardwareBackendModeConc.Gen tok/s
8bQ4_K_M
GeForce RTX 3090 · 24 GiB350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350w1
89.3117.7
8bQ4_K_M
GeForce RTX 3090 · 24 GiB450 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450w1
92.1115.5
8bQ4_K_M
GeForce RTX 5070 · 12 GiB250 Wdrv 595
llama.cpp b9174 (vulkan)baseline1
91.598.5
8bQ4_K_M
GeForce RTX 3090 · 24 GiB200 Wdrv 590
llama.cpp 59778f0 (cuda)baseline1
63.874.4
30bQ4_K_M
GeForce RTX 3090 · 24 GiB450 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450w1
36.540.9
30bQ4_K_M
GeForce RTX 3090 · 24 GiB350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350w1
36.540.2
30bQ4_K_M
GeForce RTX 3090 · 24 GiB200 Wdrv 590
llama.cpp 59778f0 (cuda)baseline1
19.521.0

NVIDIA-Nemotron-3-Nano-Omni(12)released 2026-03

VariantQuantHardwareBackendModeConc.Gen tok/s
30B-A3B-ReasoningthinkQ4_K_M
GeForce RTX 3090 · 24 GiB450 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450w1
136.8168.5
30B-A3B-ReasoningthinkQ4_K_M
GeForce RTX 3090 · 24 GiB350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350w1
132.4166.2
30B-A3B-ReasoningthinkQ4_K_M
GeForce RTX 3090 · 24 GiB200 Wdrv 590
llama.cpp 59778f0 (cuda)baseline1
109.6134.2

Qwen3.6(196)released 2026-03

VariantQuantHardwareBackendModeConc.Gen tok/s
35B-A3B-MTPthinkQ4_K_M
GeForce RTX 3090 · 24 GiB450 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)MTP n=21
122.1169.0
35B-A3B-MTPthinkQ4_K_M
GeForce RTX 3090 · 24 GiB450 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)MTP n=31
136.6161.4
35B-A3B-MTPthinkQ4_K_M
GeForce RTX 3090 · 24 GiB450 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline1
110.6135.6
35B-A3B-MTPthinkQ4_K_M
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp 4f13cb7-mtp (rocm)MTP n=31
63.470.6
35B-A3B-MTPthinkQ4_K_M
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp 4f13cb7-mtp (rocm)MTP n=21
60.870.0
27B-MTPthinkQ4_K_M
GeForce RTX 3090 · 24 GiB450 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)MTP n=21
55.963.7
27B-MTPthinkQ4_K_M
GeForce RTX 3090 · 24 GiB450 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)MTP n=31
54.659.2
27B-MTPthinkQ8_0
2× GeForce RTX 3090 · 24 GiB each450 W × 2drv 590
llama.cpp cuda-4f13cb7 (cuda)MTP n=31
53.857.1
27B-MTPthinkQ8_0
2× GeForce RTX 3090 · 24 GiB each450 W × 2drv 590
llama.cpp cuda-4f13cb7 (cuda)MTP n=21
47.853.2
35B-A3B-MTPthinkQ4_K_M
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp 4f13cb7-mtp (rocm)baseline1
46.252.3
27B-MTPthinkQ8_0
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 590
llama.cpp 4f13cb7-mtp (cuda)mtp-3-pl-200w1
45.150.6
27B-MTPthinkQ8_0
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 590
llama.cpp 4f13cb7-mtp (cuda)mtp-2-pl-200w1
42.548.7
27BthinkQ2_K
GeForce RTX 3090 · 24 GiB450 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450w1
38.243.6
27BthinkQ2_K
GeForce RTX 3090 · 24 GiB350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350w1
36.142.1
27B-MTPthinkQ4_K_M
GeForce RTX 3090 · 24 GiB450 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline1
35.640.4
27BthinkQ4_K_M
GeForce RTX 3090 · 24 GiB450 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450w1
35.840.3
27BthinkQ4_K_M
GeForce RTX 3090 · 24 GiB350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350w1
33.839.3
27BthinkQ4_K_XL
GeForce RTX 3090 · 24 GiB450 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450w1
34.638.8
27BthinkQ4_K_XL
GeForce RTX 3090 · 24 GiB350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350w1
34.038.3
27BthinkQ3_K_M
GeForce RTX 3090 · 24 GiB450 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450w1
33.337.2
27BthinkQ5_K_M
GeForce RTX 3090 · 24 GiB450 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450w1
32.736.1
27BthinkQ3_K_M
GeForce RTX 3090 · 24 GiB350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350w1
31.935.7
27BthinkQ5_K_M
GeForce RTX 3090 · 24 GiB350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350w1
31.135.7
27B-MTPthinkQ4_K_M
GeForce RTX 3090 · 24 GiB200 Wdrv 590
llama.cpp 4f13cb7-mtp (cuda)mtp-3-pl-200w1
31.134.2
27B-MTPthinkQ4_K_M
GeForce RTX 3090 · 24 GiB200 Wdrv 590
llama.cpp 4f13cb7-mtp (cuda)mtp-2-pl-200w1
29.832.0
27BthinkQ6_K
GeForce RTX 3090 · 24 GiB450 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450w1
29.032.0
27BthinkQ6_K
GeForce RTX 3090 · 24 GiB350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350w1
28.531.5
27B-MTPthinkQ8_0
2× GeForce RTX 3090 · 24 GiB each450 W × 2drv 590
llama.cpp cuda-4f13cb7 (cuda)baseline1
24.627.0
27B-MTPthinkQ8_0
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 590
llama.cpp 4f13cb7-mtp (cuda)baseline-pl-200w1
23.526.1
27BthinkQ2_K
GeForce RTX 3090 · 24 GiB200 Wdrv 590
llama.cpp 59778f0 (cuda)baseline1
22.124.3
27BthinkQ4_K_M
GeForce RTX 3090 · 24 GiB200 Wdrv 590
llama.cpp 59778f0 (cuda)baseline1
20.021.6
27B-MTPthinkQ4_K_M
GeForce RTX 3090 · 24 GiB200 Wdrv 590
llama.cpp 4f13cb7-mtp (cuda)baseline-pl-200w1
20.021.5
27B-MTPthinkQ4_K_M
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp 4f13cb7-mtp (rocm)MTP n=31
19.921.2
27BthinkQ4_K_XL
GeForce RTX 3090 · 24 GiB200 Wdrv 590
llama.cpp 59778f0 (cuda)baseline1
19.721.2
27BthinkQ3_K_M
GeForce RTX 3090 · 24 GiB200 Wdrv 590
llama.cpp 59778f0 (cuda)baseline1
19.320.9
27B-MTPthinkQ4_K_M
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp 4f13cb7-mtp (rocm)MTP n=21
18.820.0
27BthinkQ5_K_M
GeForce RTX 3090 · 24 GiB200 Wdrv 590
llama.cpp 59778f0 (cuda)baseline1
17.318.9
27B-MTPthinkQ8_0
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp 4f13cb7-mtp (rocm)MTP n=31
17.118.7
27B-MTPthinkQ8_0
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp 4f13cb7-mtp (rocm)MTP n=21
15.416.1
27BthinkQ6_K
GeForce RTX 3090 · 24 GiB200 Wdrv 590
llama.cpp 59778f0 (cuda)baseline1
14.415.5
27BthinkQ3_K_M
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7
llama.cpp rocm-4f13cb7 (rocm)baseline1
12.914.2
27BthinkQ4_K_M
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7
llama.cpp rocm-4f13cb7 (rocm)baseline1
11.112.0
27B-MTPthinkQ4_K_M
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp 4f13cb7-mtp (rocm)baseline1
11.112.0
27BthinkQ4_K_XL
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp b8940 (vulkan)baseline1
10.811.9
27BthinkQ5_K_M
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7
llama.cpp rocm-4f13cb7 (rocm)baseline1
9.910.6
27BthinkQ6_K
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7
llama.cpp rocm-4f13cb7 (rocm)baseline1
8.79.3
27B-MTPthinkQ8_0
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp 4f13cb7-mtp (rocm)baseline1
7.37.7
27BthinkQ8_0
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7
llama.cpp rocm-4f13cb7 (rocm)baseline1
7.37.7

LFM2.5(8)released 2025-11

VariantQuantHardwareBackendModeConc.Gen tok/s
350MQ4_K_M
GeForce RTX 3090 · 24 GiB350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350w1
681.0903.7
350MQ4_K_M
GeForce RTX 3090 · 24 GiB450 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450w1
676.8841.5

LFM2.5-350M(8)released 2025-11

VariantQuantHardwareBackendModeConc.Gen tok/s
350MQ4_K_M
GeForce RTX 5070 · 12 GiB250 Wdrv 595
llama.cpp b9174 (vulkan)baseline1
761.9861.4
350MQ4_K_M
GeForce RTX 3090 · 24 GiB200 Wdrv 590
llama.cpp 59778f0 (cuda)baseline1
632.0813.8

Qwen3.5(32)released 2025-10

VariantQuantHardwareBackendModeConc.Gen tok/s
35B-A3BthinkQ4_K_M
GeForce RTX 3090 · 24 GiB450 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450w1
108.3137.1
35B-A3BthinkQ4_K_M
GeForce RTX 3090 · 24 GiB350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350w1
110.4136.1
35B-A3BthinkQ4_K_XL
GeForce RTX 3090 · 24 GiB200 Wdrv 590
llama.cpp 59778f0 (cuda)baseline1
94.2119.3
35B-A3BthinkQ4_K_XL
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp b1203 (rocm)baseline1
42.448.3
27BthinkQ4_K_M
GeForce RTX 3090 · 24 GiB450 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450w1
34.140.5
27BthinkQ4_K_M
GeForce RTX 3090 · 24 GiB350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350w1
35.239.8
27BthinkQ4_K_XL
GeForce RTX 3090 · 24 GiB200 Wdrv 590
llama.cpp 59778f0 (cuda)baseline1
19.821.4
27BthinkQ4_K_XL
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp b1203 (rocm)baseline1
10.911.9

GLM-4.7-Flash(4)released 2025-09

VariantQuantHardwareBackendModeConc.Gen tok/s
FlashQ4_K_XL
GeForce RTX 3090 · 24 GiB200 Wdrv 590
llama.cpp 59778f0 (cuda)baseline1
105.4117.5

LFM2(72)released 2025-07

VariantQuantHardwareBackendModeConc.Gen tok/s
1.2BQ4_K_M
GeForce RTX 3090 · 24 GiB450 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450w1
507.6579.5
1.2B-ToolQ4_K_M
GeForce RTX 3090 · 24 GiB350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350w1
513.1578.9
1.2B-ToolQ4_K_M
GeForce RTX 3090 · 24 GiB450 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450w1
499.2578.7
1.2BQ4_K_M
GeForce RTX 3090 · 24 GiB350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350w1
454.9565.0
1.2BQ4_K_M
GeForce RTX 5070 · 12 GiB250 Wdrv 595
llama.cpp b9174 (vulkan)baseline1
485.5529.6
1.2B-ToolQ4_K_M
GeForce RTX 5070 · 12 GiB250 Wdrv 595
llama.cpp b9174 (vulkan)baseline1
475.8522.1
1.2BQ4_K_M
GeForce RTX 3090 · 24 GiB200 Wdrv 590
llama.cpp 59778f0 (cuda)baseline1
426.4471.0
1.2B-ToolQ4_K_M
GeForce RTX 3090 · 24 GiB200 Wdrv 590
llama.cpp 59778f0 (cuda)baseline1
423.6465.3
8B-A1BQ4_K_M
GeForce RTX 3090 · 24 GiB450 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450w1
299.8423.4
8B-A1BQ4_K_M
GeForce RTX 3090 · 24 GiB350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350w1
341.7406.4
8B-A1BQ4_K_M
GeForce RTX 5070 · 12 GiB250 Wdrv 595
llama.cpp b9174 (vulkan)baseline1
319.2364.9
8B-A1BQ4_K_M
GeForce RTX 3090 · 24 GiB200 Wdrv 590
llama.cpp 59778f0 (cuda)baseline1
278.6332.9
2.6BQ4_K_M
GeForce RTX 3090 · 24 GiB450 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450w1
262.9314.4
2.6BQ4_K_M
GeForce RTX 3090 · 24 GiB350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350w1
253.0306.7
2.6BQ4_K_M
GeForce RTX 5070 · 12 GiB250 Wdrv 595
llama.cpp b9174 (vulkan)baseline1
249.9268.6
2.6BQ4_K_M
GeForce RTX 3090 · 24 GiB200 Wdrv 590
llama.cpp 59778f0 (cuda)baseline1
221.1238.9
1.2BQ4_K_M
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp b8940 (rocm)baseline1
194.6208.8
8B-A1BQ4_K_M
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp b8940 (rocm)baseline1
142.6151.4

Qwen3-Coder(12)released 2025-06

VariantQuantHardwareBackendModeConc.Gen tok/s
30B-A3B-InstructthinkQ4_K_XL
GeForce RTX 3090 · 24 GiB450 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450w1
146.4183.6
30B-A3B-InstructthinkQ4_K_XL
GeForce RTX 3090 · 24 GiB350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350w1
127.2181.3
30B-A3B-InstructthinkQ4_K_XL
GeForce RTX 3090 · 24 GiB200 Wdrv 590
llama.cpp 59778f0 (cuda)baseline1
117.1152.7

Gemma-3(32)released 2025-03

VariantQuantHardwareBackendModeConc.Gen tok/s
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiB450 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450w1
95.5169.4
4b-itQ4_K_M
GeForce RTX 5070 · 12 GiB250 Wdrv 595
llama.cpp cuda-1a68ec9 (cuda)baseline1
143.0168.7
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiB350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350w1
101.6166.7
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiB300 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline1
97.1160.9
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiB200 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-200w1
84.0128.5
4b-itQ4_K_M
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp b1203 (rocm)baseline1
55.564.6

Qwen2.5-Coder(52)released 2024-11

VariantQuantHardwareBackendModeConc.Gen tok/s
7B-InstructAWQ
GeForce RTX 3090 · 24 GiB420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350w1
138.3148.6
7B-InstructAWQ
GeForce RTX 3090 · 24 GiB420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450w1
136.8148.4
7B-InstructQ4_K_M
GeForce RTX 3090 · 24 GiB450 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450w1
125.4146.4
7B-InstructQ4_K_M
GeForce RTX 3090 · 24 GiB350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350w1
104.9139.2
7B-InstructQ4_K_M
GeForce RTX 5070 · 12 GiB250 Wdrv 595
llama.cpp b9174 (vulkan)baseline1
110.5119.4
7B-InstructQ4_K_M
GeForce RTX 3090 · 24 GiB200 Wdrv 590
llama.cpp 59778f0 (cuda)baseline1
77.588.6
7B-InstructAWQ
GeForce RTX 3090 · 24 GiB200 Wdrv 590
vLLM 0.21.0 (cuda)baseline1
76.985.8
14B-InstructAWQ
GeForce RTX 3090 · 24 GiB420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350w1
77.381.2
14B-InstructAWQ
GeForce RTX 3090 · 24 GiB420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450w1
77.281.1
14B-InstructAWQ
GeForce RTX 3090 · 24 GiB200 Wdrv 590
vLLM 0.21.0 (cuda)baseline1
39.942.6
32B-InstructAWQ
GeForce RTX 3090 · 24 GiB420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350w1
40.141.3
32B-InstructAWQ
GeForce RTX 3090 · 24 GiB420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450w1
40.041.3
32B-InstructAWQ
GeForce RTX 3090 · 24 GiB200 Wdrv 590
vLLM 0.21.0 (cuda)baseline1
18.919.5

Qwen2.5(36)released 2024-09

VariantQuantHardwareBackendModeConc.Gen tok/s
7B-InstructAWQ
GeForce RTX 3090 · 24 GiB420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350w1
140.1148.8
7B-InstructAWQ
GeForce RTX 3090 · 24 GiB420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450w1
141.6148.3
7B-InstructAWQ
GeForce RTX 3090 · 24 GiB200 Wdrv 590
vLLM 0.21.0 (cuda)baseline1
77.085.2
14B-InstructAWQ
GeForce RTX 3090 · 24 GiB420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350w1
76.081.0
14B-InstructAWQ
GeForce RTX 3090 · 24 GiB420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450w1
76.381.0
14B-InstructAWQ
GeForce RTX 3090 · 24 GiB200 Wdrv 590
vLLM 0.21.0 (cuda)baseline1
38.942.6
32B-InstructAWQ
GeForce RTX 3090 · 24 GiB420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350w1
39.941.3
32B-InstructAWQ
GeForce RTX 3090 · 24 GiB420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450w1
39.941.3
32B-InstructAWQ
GeForce RTX 3090 · 24 GiB200 Wdrv 590
vLLM 0.21.0 (cuda)baseline1
18.819.3

Tok/s by workload (concurrency 1)

Same models, four different usage patterns. Prefill and decode are bound by different limits, so the same model produces noticeably different tok/s depending on prompt and answer length. KV-cache size, batch size, and any hidden reasoningtokens the model emits also move the number. Use the ranges in "Model speed range" above as the headline; this section explains why those ranges exist.

chat
LFM2.5-350MQ4_K_M · GeForce RTX 5070 · 12 GiB
761.9 tok/s
LFM2 1.2B-ToolQ4_K_M · GeForce RTX 3090 · 24 GiB
547.1 tok/s
LFM2 1.2BQ4_K_M · GeForce RTX 3090 · 24 GiB
536.2 tok/s
LFM2 8B-A1BQ4_K_M · GeForce RTX 3090 · 24 GiB
385.1 tok/s
LFM2 2.6BQ4_K_M · GeForce RTX 3090 · 24 GiB
309.0 tok/s
Gemma-4 E2B-itQ4_K_M · GeForce RTX 5070 · 12 GiB
211.9 tok/s
Qwen3-Coder 30B-A3B-InstructQ4_K_XL · GeForce RTX 3090 · 24 GiB
169.7 tok/s
Gemma-3 4b-itQ4_K_M · GeForce RTX 3090 · 24 GiB
167.9 tok/s
Qwen3.6 35B-A3B-MTPQ4_K_M · GeForce RTX 3090 · 24 GiB
148.3 tok/s
Qwen2.5 7B-InstructAWQ · GeForce RTX 3090 · 24 GiB
148.0 tok/s
Qwen2.5-Coder 7B-InstructAWQ · GeForce RTX 3090 · 24 GiB
147.5 tok/s
Qwen2.5-Coder 7B-InstructQ4_K_M · GeForce RTX 3090 · 24 GiB
146.4 tok/s
NVIDIA-Nemotron-3-Nano-Omni 30B-A3B-ReasoningQ4_K_M · GeForce RTX 3090 · 24 GiB
141.0 tok/s
Gemma-4 E4B-itQ4_K_M · GeForce RTX 3090 · 24 GiB
137.3 tok/s
Qwen3.5 35B-A3BQ4_K_M · GeForce RTX 3090 · 24 GiB
124.5 tok/s
granite-4.1 8bQ4_K_M · GeForce RTX 3090 · 24 GiB
117.7 tok/s
GLM-4.7-FlashQ4_K_XL · GeForce RTX 3090 · 24 GiB
117.5 tok/s
Gemma-4 26B-A4B-itQ4_K_M · GeForce RTX 3090 · 24 GiB
104.1 tok/s
Qwen2.5-Coder 14B-InstructAWQ · GeForce RTX 3090 · 24 GiB
80.7 tok/s
Qwen2.5 14B-InstructAWQ · GeForce RTX 3090 · 24 GiB
80.7 tok/s
Qwen3.6 27B-MTPQ4_K_M · GeForce RTX 3090 · 24 GiB
59.5 tok/s
Qwen3.6 27B-MTPQ8_0 · 2× GeForce RTX 3090 · 24 GiB each
55.9 tok/s
Qwen3.6 27BQ2_K · GeForce RTX 3090 · 24 GiB
42.3 tok/s
Qwen2.5 32B-InstructAWQ · GeForce RTX 3090 · 24 GiB
41.0 tok/s
Qwen2.5-Coder 32B-InstructAWQ · GeForce RTX 3090 · 24 GiB
41.0 tok/s
granite-4.1 30bQ4_K_M · GeForce RTX 3090 · 24 GiB
40.4 tok/s
Qwen3.5 27BQ4_K_M · GeForce RTX 3090 · 24 GiB
37.9 tok/s
Qwen3.6 27BQ4_K_M · GeForce RTX 3090 · 24 GiB
37.4 tok/s
Qwen3.6 27BQ4_K_XL · GeForce RTX 3090 · 24 GiB
36.7 tok/s
Qwen3.6 27BQ3_K_M · GeForce RTX 3090 · 24 GiB
35.8 tok/s
Gemma-4 31B-itQ4_K_M · GeForce RTX 3090 · 24 GiB
34.3 tok/s
Qwen3.6 27BQ5_K_M · GeForce RTX 3090 · 24 GiB
34.2 tok/s
Qwen3.6 27BQ6_K · GeForce RTX 3090 · 24 GiB
30.4 tok/s
Qwen3.6 27BQ8_0 · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
7.5 tok/s

Short prompt, short answer. Generation-bound, so output tok/s is a clean reflection of the model's peak decode rate on this hardware.

rag
LFM2.5-350MQ4_K_M · GeForce RTX 5070 · 12 GiB
792.4 tok/s
LFM2 1.2B-ToolQ4_K_M · GeForce RTX 3090 · 24 GiB
522.9 tok/s
LFM2 1.2BQ4_K_M · GeForce RTX 3090 · 24 GiB
516.1 tok/s
LFM2 8B-A1BQ4_K_M · GeForce RTX 3090 · 24 GiB
341.7 tok/s
LFM2 2.6BQ4_K_M · GeForce RTX 3090 · 24 GiB
262.9 tok/s
Gemma-4 E2B-itQ4_K_M · GeForce RTX 5070 · 12 GiB
206.5 tok/s
Qwen3-Coder 30B-A3B-InstructQ4_K_XL · GeForce RTX 3090 · 24 GiB
146.4 tok/s
Gemma-3 4b-itQ4_K_M · GeForce RTX 5070 · 12 GiB
143.0 tok/s
Qwen2.5 7B-InstructAWQ · GeForce RTX 3090 · 24 GiB
141.6 tok/s
Qwen2.5-Coder 7B-InstructAWQ · GeForce RTX 3090 · 24 GiB
138.3 tok/s
NVIDIA-Nemotron-3-Nano-Omni 30B-A3B-ReasoningQ4_K_M · GeForce RTX 3090 · 24 GiB
136.8 tok/s
Qwen3.6 35B-A3B-MTPQ4_K_M · GeForce RTX 3090 · 24 GiB
136.6 tok/s
Qwen2.5-Coder 7B-InstructQ4_K_M · GeForce RTX 3090 · 24 GiB
125.4 tok/s
Gemma-4 E4B-itQ4_K_M · GeForce RTX 3090 · 24 GiB
122.5 tok/s
Qwen3.5 35B-A3BQ4_K_M · GeForce RTX 3090 · 24 GiB
110.4 tok/s
GLM-4.7-FlashQ4_K_XL · GeForce RTX 3090 · 24 GiB
105.4 tok/s
granite-4.1 8bQ4_K_M · GeForce RTX 3090 · 24 GiB
92.1 tok/s
Gemma-4 26B-A4B-itQ4_K_M · GeForce RTX 3090 · 24 GiB
84.9 tok/s
Qwen2.5-Coder 14B-InstructAWQ · GeForce RTX 3090 · 24 GiB
77.3 tok/s
Qwen2.5 14B-InstructAWQ · GeForce RTX 3090 · 24 GiB
76.3 tok/s
Qwen3.6 27B-MTPQ4_K_M · GeForce RTX 3090 · 24 GiB
55.9 tok/s
Qwen3.6 27B-MTPQ8_0 · 2× GeForce RTX 3090 · 24 GiB each
53.8 tok/s
Qwen2.5-Coder 32B-InstructAWQ · GeForce RTX 3090 · 24 GiB
40.1 tok/s
Qwen2.5 32B-InstructAWQ · GeForce RTX 3090 · 24 GiB
39.9 tok/s
Qwen3.6 27BQ2_K · GeForce RTX 3090 · 24 GiB
38.2 tok/s
granite-4.1 30bQ4_K_M · GeForce RTX 3090 · 24 GiB
36.5 tok/s
Qwen3.6 27BQ4_K_M · GeForce RTX 3090 · 24 GiB
35.8 tok/s
Qwen3.5 27BQ4_K_M · GeForce RTX 3090 · 24 GiB
35.2 tok/s
Qwen3.6 27BQ4_K_XL · GeForce RTX 3090 · 24 GiB
34.6 tok/s
Qwen3.6 27BQ3_K_M · GeForce RTX 3090 · 24 GiB
33.3 tok/s
Qwen3.6 27BQ5_K_M · GeForce RTX 3090 · 24 GiB
32.7 tok/s
Qwen3.6 27BQ6_K · GeForce RTX 3090 · 24 GiB
29.0 tok/s
Gemma-4 31B-itQ4_K_M · GeForce RTX 3090 · 24 GiB
24.5 tok/s
Qwen3.6 27BQ8_0 · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
7.3 tok/s

Long stuffed-context prompt, short answer. Prefill dominates time-to-first-token; gen tok/s usually dips slightly because the KV cache is hot but bigger.

codegen
LFM2.5 350MQ4_K_M · GeForce RTX 3090 · 24 GiB
903.7 tok/s
LFM2 1.2BQ4_K_M · GeForce RTX 3090 · 24 GiB
579.5 tok/s
LFM2 1.2B-ToolQ4_K_M · GeForce RTX 3090 · 24 GiB
578.9 tok/s
LFM2 8B-A1BQ4_K_M · GeForce RTX 3090 · 24 GiB
423.4 tok/s
LFM2 2.6BQ4_K_M · GeForce RTX 3090 · 24 GiB
314.4 tok/s
Gemma-4 E2B-itQ4_K_M · GeForce RTX 5070 · 12 GiB
216.7 tok/s
Qwen3-Coder 30B-A3B-InstructQ4_K_XL · GeForce RTX 3090 · 24 GiB
183.6 tok/s
Gemma-3 4b-itQ4_K_M · GeForce RTX 3090 · 24 GiB
169.4 tok/s
Qwen3.6 35B-A3B-MTPQ4_K_M · GeForce RTX 3090 · 24 GiB
169.0 tok/s
NVIDIA-Nemotron-3-Nano-Omni 30B-A3B-ReasoningQ4_K_M · GeForce RTX 3090 · 24 GiB
168.5 tok/s
Qwen2.5 7B-InstructAWQ · GeForce RTX 3090 · 24 GiB
148.8 tok/s
Qwen2.5-Coder 7B-InstructAWQ · GeForce RTX 3090 · 24 GiB
148.6 tok/s
Qwen2.5-Coder 7B-InstructQ4_K_M · GeForce RTX 3090 · 24 GiB
140.5 tok/s
Gemma-4 E4B-itQ4_K_M · GeForce RTX 3090 · 24 GiB
137.3 tok/s
Qwen3.5 35B-A3BQ4_K_M · GeForce RTX 3090 · 24 GiB
137.1 tok/s
GLM-4.7-FlashQ4_K_XL · GeForce RTX 3090 · 24 GiB
117.4 tok/s
granite-4.1 8bQ4_K_M · GeForce RTX 3090 · 24 GiB
115.5 tok/s
Gemma-4 26B-A4B-itQ4_K_M · GeForce RTX 3090 · 24 GiB
109.6 tok/s
Qwen2.5-Coder 14B-InstructAWQ · GeForce RTX 3090 · 24 GiB
81.2 tok/s
Qwen2.5 14B-InstructAWQ · GeForce RTX 3090 · 24 GiB
81.0 tok/s
Qwen3.6 27B-MTPQ4_K_M · GeForce RTX 3090 · 24 GiB
63.7 tok/s
Qwen3.6 27B-MTPQ8_0 · 2× GeForce RTX 3090 · 24 GiB each
57.1 tok/s
Qwen3.6 27BQ2_K · GeForce RTX 3090 · 24 GiB
43.6 tok/s
Qwen2.5 32B-InstructAWQ · GeForce RTX 3090 · 24 GiB
41.3 tok/s
Qwen2.5-Coder 32B-InstructAWQ · GeForce RTX 3090 · 24 GiB
41.3 tok/s
granite-4.1 30bQ4_K_M · GeForce RTX 3090 · 24 GiB
40.9 tok/s
Qwen3.5 27BQ4_K_M · GeForce RTX 3090 · 24 GiB
40.5 tok/s
Qwen3.6 27BQ4_K_M · GeForce RTX 3090 · 24 GiB
40.3 tok/s
Qwen3.6 27BQ4_K_XL · GeForce RTX 3090 · 24 GiB
38.8 tok/s
Qwen3.6 27BQ3_K_M · GeForce RTX 3090 · 24 GiB
37.2 tok/s
Qwen3.6 27BQ5_K_M · GeForce RTX 3090 · 24 GiB
36.1 tok/s
Gemma-4 31B-itQ4_K_M · GeForce RTX 3090 · 24 GiB
34.9 tok/s
Qwen3.6 27BQ6_K · GeForce RTX 3090 · 24 GiB
32.0 tok/s
Qwen3.6 27BQ8_0 · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
7.6 tok/s

Short prompt, long answer (~1k tokens). Pure decode loop. Numbers here tend to be the closest to the model's sustained ceiling.

agent
LFM2.5-350MQ4_K_M · GeForce RTX 5070 · 12 GiB
824.4 tok/s
LFM2 1.2B-ToolQ4_K_M · GeForce RTX 3090 · 24 GiB
573.3 tok/s
LFM2 1.2BQ4_K_M · GeForce RTX 3090 · 24 GiB
548.7 tok/s
LFM2 8B-A1BQ4_K_M · GeForce RTX 3090 · 24 GiB
397.4 tok/s
LFM2 2.6BQ4_K_M · GeForce RTX 3090 · 24 GiB
299.8 tok/s
Gemma-4 E2B-itQ4_K_M · GeForce RTX 5070 · 12 GiB
209.5 tok/s
Qwen3-Coder 30B-A3B-InstructQ4_K_XL · GeForce RTX 3090 · 24 GiB
170.4 tok/s
Gemma-3 4b-itQ4_K_M · GeForce RTX 5070 · 12 GiB
163.7 tok/s
Qwen3.6 35B-A3B-MTPQ4_K_M · GeForce RTX 3090 · 24 GiB
162.7 tok/s
NVIDIA-Nemotron-3-Nano-Omni 30B-A3B-ReasoningQ4_K_M · GeForce RTX 3090 · 24 GiB
159.6 tok/s
Qwen2.5 7B-InstructAWQ · GeForce RTX 3090 · 24 GiB
147.1 tok/s
Qwen2.5-Coder 7B-InstructAWQ · GeForce RTX 3090 · 24 GiB
146.6 tok/s
Qwen2.5-Coder 7B-InstructQ4_K_M · GeForce RTX 3090 · 24 GiB
136.6 tok/s
Gemma-4 E4B-itQ4_K_M · GeForce RTX 3090 · 24 GiB
133.4 tok/s
Qwen3.5 35B-A3BQ4_K_M · GeForce RTX 3090 · 24 GiB
129.3 tok/s
granite-4.1 8bQ4_K_M · GeForce RTX 3090 · 24 GiB
112.6 tok/s
GLM-4.7-FlashQ4_K_XL · GeForce RTX 3090 · 24 GiB
111.2 tok/s
Gemma-4 26B-A4B-itQ4_K_M · GeForce RTX 3090 · 24 GiB
100.1 tok/s
Qwen2.5-Coder 14B-InstructAWQ · GeForce RTX 3090 · 24 GiB
80.2 tok/s
Qwen2.5 14B-InstructAWQ · GeForce RTX 3090 · 24 GiB
80.1 tok/s
Qwen3.6 27B-MTPQ4_K_M · GeForce RTX 3090 · 24 GiB
61.7 tok/s
Qwen3.6 27B-MTPQ8_0 · 2× GeForce RTX 3090 · 24 GiB each
55.0 tok/s
Qwen3.6 27BQ2_K · GeForce RTX 3090 · 24 GiB
42.2 tok/s
Qwen2.5-Coder 32B-InstructAWQ · GeForce RTX 3090 · 24 GiB
41.2 tok/s
Qwen2.5 32B-InstructAWQ · GeForce RTX 3090 · 24 GiB
41.0 tok/s
granite-4.1 30bQ4_K_M · GeForce RTX 3090 · 24 GiB
39.9 tok/s
Qwen3.6 27BQ4_K_M · GeForce RTX 3090 · 24 GiB
39.2 tok/s
Qwen3.5 27BQ4_K_M · GeForce RTX 3090 · 24 GiB
39.1 tok/s
Qwen3.6 27BQ4_K_XL · GeForce RTX 3090 · 24 GiB
37.5 tok/s
Qwen3.6 27BQ3_K_M · GeForce RTX 3090 · 24 GiB
35.9 tok/s
Qwen3.6 27BQ5_K_M · GeForce RTX 3090 · 24 GiB
35.1 tok/s
Gemma-4 31B-itQ4_K_M · GeForce RTX 3090 · 24 GiB
31.4 tok/s
Qwen3.6 27BQ6_K · GeForce RTX 3090 · 24 GiB
31.3 tok/s
Qwen3.6 27BQ8_0 · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
7.7 tok/s

Mid-length prompt with tool-call shape, mid-length answer. Realistic for agentic loops. The big drop you'll see at concurrency 4 (in the per-model detail page) is the more useful agent number.

Caveat: reasoning models

Models that stream a hidden reasoning_content channel before the user-visible answer (Qwen3.5/3.6, DeepSeek-R1, GPT-OSS reasoning variants) currently count those tokens in output_tok_per_s. The decode rate is honest, but the rate of useful answer text is lower because some of every token budget is spent on the hidden chain-of-thought. The schema flag model.reasoningis not yet reliable across providers, so the per-model detail pages don't mark them explicitly.

Next: separate reasoning_tokens_median from content_tokens_median in the harness, and add a reasoning-disabled run mode (per-model: Qwen enable_thinking: false, DeepSeek /no_think, etc.).

Coming soon

Queued for the next benchmark pass. Tracking notes live in docs/benchmark-campaign.md in the repo.

  • Strix quant sweep mirroring the 3090 Q2_K..Q6_K data on ROCm. Lemonade's async pull semantics broke the first attempt; switching to a raw llama-server invocation against the bundled ROCm binary.
  • Strix quant-creator comparison for one model from unsloth, bartowski, ggml-org where they all ship the same nominal quant.
  • Strix-only heavyweights: Mistral-Medium-3.5 128B, Mistral-Small-4 119B, Qwen3-Coder-Next 80B. Each is a 50-75 GB Q4_K_M download.
  • RTX 5070 CUDA pass. Currently Vulkan-only because CUDA toolkit install hit a packaging blocker on CachyOS; that's resolved now, so a llama.cpp CUDA build can land for a CUDA-vs-Vulkan comparison on the same NVIDIA silicon.
  • Driver and power-cap sweeps on the RTX 3090 once the rest of the matrix settles.