Introducing open-weight model speed benchmarks
I wanted a place to put real inference numbers for the open-weight models I run locally. Not vibes, not marketing slides. Actual tokens per second on actual hardware, with the quantization, backend, and workload all spelled out. So I built one.
What's in v1
The site is at /benchmarks. v1 ships 55 YAML files across three rigs and five backends, covering models from 0.35B to 35B params.
The hardware:
- Strix Halo Framework Desktop (AMD Ryzen AI MAX+ 395, Radeon 8060S, 128 GiB unified memory with 96 GiB mapped to VRAM). Runs llama.cpp on ROCm, Vulkan, and CPU.
- Custom quad-RTX 3090 build (AMD EPYC 7302P, 200 W power cap per card). Two LXC containers split the work: one for llama.cpp + CUDA, one for vLLM + CUDA. Every 3090 result uses one card unless explicitly tagged otherwise.
- A gaming desktop with an RTX 5070 (12 GiB GDDR7, AMD Ryzen 9 7900). CachyOS, prebuilt llama.cpp Vulkan binary. Gives a cross-vendor Vulkan comparison against Strix.
The models, grouped by size:
- Sub-1B: LFM2.5-350M
- 1B-3B: LFM2-1.2B (plus the Tool fine-tune), LFM2-2.6B, Gemma-4-E2B
- 4B-9B: Gemma-3-4b, Gemma-4-E4B, LFM2-8B-A1B (MoE), granite-4.1-8b, Qwen2.5-Coder-7B
- 26B-35B GGUF: Gemma-4-26B-A4B (MoE), Gemma-4-31B, granite-4.1-30b, Qwen3.5-27B, Qwen3.6-27B, Qwen3-Coder-30B-A3B, GLM-4.7-Flash (30B-A3B), NVIDIA Nemotron-3-Nano-Omni-30B-A3B-Reasoning, Qwen3.5-35B-A3B
- vLLM AWQ on the 3090: Qwen2.5-Instruct and Qwen2.5-Coder-Instruct, each at 7B, 14B, and 32B. Six runs total, all CUDA.
One model, six quants
The site also carries a 6-quant sweep of Qwen3.6-27B: Q2_K, Q3_K_M, Q4_K_M, Q4_K_XL, Q5_K_M, Q6_K on the RTX 3090 (CUDA), paired with the Q4_K_XL baseline on Strix Halo (ROCm). Same model, same prompts, different bit budgets and different silicon. The chart that comes out of it is the cleanest way I've seen to read "how much does the quant actually cost you" alongside "how does CUDA compare to ROCm on memory-bandwidth-bound decode."
Workload shapes matter more than I expected
Every run is benched across four shapes: chat, rag, codegen, and agent. The agent shape runs at concurrency 1 and concurrency 4. The drop from c=1 to c=4 is the realistic agentic number. Single-stream tok/s reads way better than the model actually performs under real load.
The site shows tok/s as a range across shapes by default. A single number is almost always misleading.
How runs are measured
Three measured iterations after one warmup. Temperature 0. Streamed responses. Medians reported. Every run verifies the model actually fits in VRAM before it starts, so a model that silently spills to system RAM gets caught instead of producing misleading numbers. The YAML files are in the repo if you want to audit or reproduce.
A few headline numbers
A few patterns jump out.
Memory bandwidth runs the show for decode. The RTX 5070 actually beats the 3090 on every model that fits in 12 GiB, even though the 5070 runs Vulkan and the 3090 runs CUDA. The 5070's GDDR7 outruns the 3090's GDDR6X, and decode is bandwidth-bound. Gemma-3-4b on chat: 5070 hits 156.6 tok/s versus 3090 at 142.0 tok/s.
The 3090 wins when the model fits in 24 GiB but not 12 GiB, which is the 14-31B band. Gemma-4-26B-A4B chat is 100.5 tok/s on the 3090 versus 43.7 on Strix ROCm and 47.7 on Strix Vulkan, roughly 2x. That's the bandwidth gap doing its job in the other direction.
Strix Vulkan is consistently a hair faster than Strix ROCm on these models, by ~1-10%. Not the result I expected. Lemonade's bundled llamacpp:rocm may be one optimization rev behind its llamacpp:vulkan build, or the gfx1151 ROCm kernels for these op shapes simply aren't fully tuned yet. Worth a deeper look in a future post.
Quant cost on the 3090 for Qwen3.6-27B chat: Q2_K is 24.0 tok/s, Q3_K_M 20.5, Q4_K_M 21.1, Q4_K_XL 21.1, Q5_K_M 18.6, Q6_K 15.3. Q2 to Q6 is a 1.6x range. Q4 is the sweet spot; going below saves a little memory but the speedup is marginal, going above pays a real tax.
Qwen reasoning models (Qwen3.5-27B, Qwen3.6-27B) look ~5x slower than similarly-sized Gemmas because most of their output goes to a hidden reasoning_content channel that still counts in the output rate. The decode tok/s is honest. The useful-answer tok/s is lower.
What about cards we don't have
I don't own a 4090, 5080, 5090, or RTX 6000 Pro Blackwell to bench against. The data here gives you two reasonable reference points to extrapolate from, and decode performance lines up with memory bandwidth almost linearly for the bandwidth-bound shapes (chat / codegen / agent c=1).
Approximate bandwidth ladder, useful as a multiplier on what the dataset shows:
- RTX 5070 (this dataset): ~672 GB/s GDDR7, 12 GiB. Baseline for "newer consumer NVIDIA, low VRAM."
- RTX 4090: ~1008 GB/s GDDR6X, 24 GiB. Roughly 1.5x the 5070's decode rate for models that fit, plus the 24 GiB header that the 3090 in this dataset already covers.
- RTX 5080: ~960 GB/s GDDR7, 16 GiB. Similar bandwidth to a 4090 with a smaller VRAM pool.
- RTX 5090: ~1792 GB/s GDDR7, 32 GiB. ~2.6x the 5070's decode rate, holds 27B-class at higher quants without splitting.
- RTX 6000 Pro (Blackwell): ~1.8 TB/s GDDR7, 96 GiB. Same architecture and bandwidth tier as the 5090 but with the 96 GiB to run the heavyweights at full speed.
Two caveats. First, prefill (rag shape's TTFT) is compute-bound, not bandwidth-bound, so the scaling there leans more on shader count than the simple bandwidth ratios suggest. Second, FP8 / FP4 paths on Ada and Blackwell unlock more than memory bandwidth alone predicts on the right models; vLLM FP8 in particular gets non-trivial speedups on cards that have the kernels.
I'd love to add the 4090, 5080, 5090, and RTX 6000 Pro Blackwell to a v2 or v3 of the dataset. I just don't have any of those cards yet. If a card shows up in the homelab, it'll get the same harness pointed at it.
Other hardware I want to compare
A few systems I'd specifically like to bench against Strix Halo and the RTX lineup:
- NVIDIA DGX Spark / GB10 (Grace ARM + Blackwell GPU, ~128 GiB unified memory). The most direct competitor to Strix Halo in the unified-memory desktop-class category, and the obvious "is the Strix Halo trade-off worth it" comparison.
- Intel Arc B70 (Battlemage). The Vulkan numbers from the 5070 already make me suspicious that Intel's Vulkan path might be more competitive than people give it credit for, especially for the bandwidth-bound decode loop.
- AMD Radeon Pro W7900 (RDNA3, 48 GiB). A real "more VRAM than a consumer card, less budget than a DGX" data point on the AMD side, plus would isolate whether the surprising Strix Vulkan results are gfx1151-specific.
RTX cards win on raw speed in this dataset, but they don't win on dollars-per-tok/s, watts-per-tok/s, or tok/s-per-GiB-VRAM. Those are real axes the dataset doesn't yet capture, and they're where Strix Halo and the unified-memory class punch above their weight.
What's deferred to v2
A few phases didn't make v1 and are queued for the next pass:
- Strix vLLM FP8 + MTP-1 + draft-spec. Blocked on lemonade's ~10-min hardcoded backend-readiness timeout. Fix plan documented in
docs/benchmark-campaign.md. - Strix quant sweep (the cross-vendor pair for the 3090's Q2_K..Q6_K data). Lemonade's async pull semantics caused all benches to 404. Switching to the bundled ROCm
llama-serverdirectly. - Heavyweights: Mistral-Medium-3.5 128B, Mistral-Small-4 119B, Qwen3-Coder-Next 80B. Strix-only by VRAM. Each is ~50-75 GB download plus 30-60 min bench.
- RTX 5070 CUDA pass. Currently Vulkan-only. After the recent CachyOS update the CUDA toolkit should install cleanly, so a llama.cpp CUDA build can land for a direct CUDA-vs-Vulkan comparison on the same NVIDIA silicon.
I'll keep testing new models on the hardware I already have as they ship, even before the wishlist cards arrive. The site rebuilds on every YAML change in content/benchmarks/runs/, so anything new shows up automatically.
What I want to write next
The dataset is meant to be the substrate for actual posts, not the end goal. A few I'm planning, in roughly the order I'll get to them:
- Should you go RTX, MLX, or Strix Halo for home AI? With the data above plus the gaps in the "what about cards we don't have" section, this is the post the dataset was built to support.
- How much do quants actually cost you? The 6-quant sweep of Qwen3.6-27B is one model on one rig. The general answer needs more models and probably a quality dimension on top of the speed numbers.
- Choosing a quant at a fixed budget. Higher-param model more aggressively quantized vs lower-param model less quantized at the same VRAM ceiling. The 30B-A3B MoE peer group is most of what you'd need.
- Why you probably shouldn't be paying $200/month for an AI plan if you have any meaningful local hardware. The dataset answers "what can my hardware actually do" before that conversation can happen.
- Self-hosting use cases worth the bother: Home Assistant local voice, an overnight local-model coding agent, supplementing a small business's hosted-API spend with local capacity for the right workloads.
- Setup posts: how I host the inference servers in Proxmox LXC containers and why I chose LXC over VMs; a starter glossary for running your own open-weight models locally.