Benchmarks
Inference speed measurements for open-weight models across quantizations, backends, and hardware. Source YAMLs live in content/benchmarks/runs/.
Click a column header to sort. Hover the dotted-underlined labels for definitions. When no shape is selected, gen tok/s shows a range across all workload shapes at the chosen concurrency.
▸ Gemma-4(32)released 2026-04
| Variant | Quant | Hardware | Backend | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|
| E2B-it | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b8940 (rocm) | 1 | 76.3–87.6 |
| E4B-it | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b8940 (vulkan) | 1 | 50.4–53.8 |
| E4B-it | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b1203 (rocm) | 1 | 48.1–52.4 |
| 26B-A4B-it | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b8940 (vulkan) | 1 | 43.2–47.9 |
| 26B-A4B-it | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b1203 (rocm) | 1 | 40.7–46.0 |
| 31B-it | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b1203 (rocm) | 1 | 9.1–10.2 |
| E4B-it | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b8940 (cpu) | 1 | 8.6–10.2 |
| 26B-A4B-it | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b8940 (cpu) | 1 | 6.9–8.5 |
▸ Qwen3.6(8)released 2026-03
▸ Qwen3.5(8)released 2025-10
▸ LFM2(8)released 2025-07
▸ Gemma-3(4)released 2025-03
| Variant | Quant | Hardware | Backend | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|
| 4b-it | Q4_K_M | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b1203 (rocm) | 1 | 55.5–64.6 |
Tok/s by workload (concurrency 1)
Same models, four different usage patterns. Prefill and decode are bound by different limits, so the same model produces noticeably different tok/s depending on prompt and answer length. KV-cache size, batch size, and any hidden reasoningtokens the model emits also move the number. Use the ranges in "Model speed range" above as the headline; this section explains why those ranges exist.
Short prompt, short answer. Generation-bound, so output tok/s is a clean reflection of the model's peak decode rate on this hardware.
Long stuffed-context prompt, short answer. Prefill dominates time-to-first-token; gen tok/s usually dips slightly because the KV cache is hot but bigger.
Short prompt, long answer (~1k tokens). Pure decode loop. Numbers here tend to be the closest to the model's sustained ceiling.
Mid-length prompt with tool-call shape, mid-length answer. Realistic for agentic loops. The big drop you'll see at concurrency 4 (in the per-model detail page) is the more useful agent number.
Models that stream a hidden reasoning_content channel before the user-visible answer (Qwen3.5/3.6, DeepSeek-R1, GPT-OSS reasoning variants) currently count those tokens in output_tok_per_s. The decode rate is honest, but the rate of useful answer text is lower because some of every token budget is spent on the hidden chain-of-thought. The schema flag model.reasoningis not yet reliable across providers, so the per-model detail pages don't mark them explicitly.
Next: separate reasoning_tokens_median from content_tokens_median in the harness, and add a reasoning-disabled run mode (per-model: Qwen enable_thinking: false, DeepSeek /no_think, etc.).
Hardware tested
The rigs producing the numbers above. Use the hardware filter at the top of the page to scope results to a specific machine.
Framework Desktop with the AMD Ryzen AI Max+ 395 (Strix Halo) APU. 128 GiB of unified LPDDR5X system memory; the GPU side sees 96 GiB through the unified-memory pool. Integrated Radeon 8060S handles the inference workload via ROCm. No discrete GPU, no separate VRAM pool — the 27B-class models in this benchmark set all run on a single APU.
Coming soon
Queued for the next benchmark pass. Tracking notes live in docs/benchmark-campaign.md in the repo.
- Strix vLLM FP8 + MTP-1 + draft-spec on Qwen3.6-27B. Blocked on lemonade's hardcoded backend-readiness timeout cutting off the first-load FP8 kernel autotune. Bypass via the bundled vLLM binary to warm the cache, then hand back to lemonade.
- Strix quant sweep mirroring the 3090 Q2_K..Q6_K data on ROCm. Lemonade's async pull semantics broke the first attempt; switching to a raw
llama-serverinvocation against the bundled ROCm binary. - Strix quant-creator comparison for one model from unsloth, bartowski, ggml-org where they all ship the same nominal quant.
- Strix-only heavyweights: Mistral-Medium-3.5 128B, Mistral-Small-4 119B, Qwen3-Coder-Next 80B. Each is a 50-75 GB Q4_K_M download.
- RTX 5070 CUDA pass. Currently Vulkan-only because CUDA toolkit install hit a packaging blocker on CachyOS; that's resolved now, so a llama.cpp CUDA build can land for a CUDA-vs-Vulkan comparison on the same NVIDIA silicon.
- Driver and power-cap sweeps on the RTX 3090 once the rest of the matrix settles.