Benchmarks

Inference speed measurements for open-weight models across quantizations, backends, and hardware. Source YAMLs live in content/benchmarks/runs/.

Click a column header to sort. Hover the dotted-underlined labels for definitions. When no shape is selected, gen tok/s shows a range across all workload shapes at the chosen concurrency.

▸ Gemma-4(32)released 2026-04

Variant	Quant	Hardware	Backend	Conc.	Gen tok/s ↓
E2B-it	Q4_K_M	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)	llama.cpp b8940 (rocm)	1	76.3–87.6
E4B-it	Q4_K_M	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)	llama.cpp b8940 (vulkan)	1	50.4–53.8
E4B-it	Q4_K_M	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)	llama.cpp b1203 (rocm)	1	48.1–52.4
26B-A4B-it	Q4_K_M	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)	llama.cpp b8940 (vulkan)	1	43.2–47.9
26B-A4B-it	Q4_K_M	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)	llama.cpp b1203 (rocm)	1	40.7–46.0
31B-it	Q4_K_M	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)	llama.cpp b1203 (rocm)	1	9.1–10.2
E4B-it	Q4_K_M	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)	llama.cpp b8940 (cpu)	1	8.6–10.2
26B-A4B-it	Q4_K_M	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)	llama.cpp b8940 (cpu)	1	6.9–8.5

▸ Qwen3.6(8)released 2026-03

Variant	Quant	Hardware	Backend	Conc.	Gen tok/s ↓
27B	Q4_K_XL	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)	llama.cpp b8940 (vulkan)	1	10.8–11.9
27B	Q4_K_XL	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)	llama.cpp b1203 (rocm)	1	10.5–11.5

▸ Qwen3.5(8)released 2025-10

Variant	Quant	Hardware	Backend	Conc.	Gen tok/s ↓
35B-A3B	Q4_K_XL	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)	llama.cpp b1203 (rocm)	1	42.4–48.3
27B	Q4_K_XL	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)	llama.cpp b1203 (rocm)	1	10.9–11.9

▸ LFM2(8)released 2025-07

Variant	Quant	Hardware	Backend	Conc.	Gen tok/s ↓
1.2B	Q4_K_M	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)	llama.cpp b8940 (rocm)	1	194.6–208.8
8B-A1B	Q4_K_M	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)	llama.cpp b8940 (rocm)	1	142.6–151.4

▸ Gemma-3(4)released 2025-03

Variant	Quant	Hardware	Backend	Conc.	Gen tok/s ↓
4b-it	Q4_K_M	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)	llama.cpp b1203 (rocm)	1	55.5–64.6

Tok/s by workload (concurrency 1)

Same models, four different usage patterns. Prefill and decode are bound by different limits, so the same model produces noticeably different tok/s depending on prompt and answer length. KV-cache size, batch size, and any hidden reasoningtokens the model emits also move the number. Use the ranges in "Model speed range" above as the headline; this section explains why those ranges exist.

chat

LFM2 1.2BQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

204.5 tok/s

LFM2 8B-A1BQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

144.6 tok/s

Gemma-4 E2B-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

86.2 tok/s

Gemma-3 4b-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

64.2 tok/s

Gemma-4 E4B-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

52.9 tok/s

Gemma-4 26B-A4B-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

47.7 tok/s

Qwen3.5 35B-A3BQ4_K_XL · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

46.0 tok/s

Qwen3.5 27BQ4_K_XL · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

11.6 tok/s

Qwen3.6 27BQ4_K_XL · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

11.6 tok/s

Gemma-4 31B-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

9.9 tok/s

Short prompt, short answer. Generation-bound, so output tok/s is a clean reflection of the model's peak decode rate on this hardware.

rag

LFM2 1.2BQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

194.7 tok/s

LFM2 8B-A1BQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

142.6 tok/s

Gemma-4 E2B-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

76.3 tok/s

Gemma-3 4b-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

55.5 tok/s

Gemma-4 E4B-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

50.4 tok/s

Gemma-4 26B-A4B-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

43.2 tok/s

Qwen3.5 35B-A3BQ4_K_XL · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

42.4 tok/s

Qwen3.5 27BQ4_K_XL · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

10.9 tok/s

Qwen3.6 27BQ4_K_XL · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

10.8 tok/s

Gemma-4 31B-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

9.1 tok/s

Long stuffed-context prompt, short answer. Prefill dominates time-to-first-token; gen tok/s usually dips slightly because the KV cache is hot but bigger.

codegen

LFM2 1.2BQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

208.8 tok/s

LFM2 8B-A1BQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

151.4 tok/s

Gemma-4 E2B-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

87.6 tok/s

Gemma-3 4b-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

64.6 tok/s

Gemma-4 E4B-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

53.8 tok/s

Qwen3.5 35B-A3BQ4_K_XL · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

48.3 tok/s

Gemma-4 26B-A4B-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

47.9 tok/s

Qwen3.5 27BQ4_K_XL · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

11.9 tok/s

Qwen3.6 27BQ4_K_XL · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

11.9 tok/s

Gemma-4 31B-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

10.2 tok/s

Short prompt, long answer (~1k tokens). Pure decode loop. Numbers here tend to be the closest to the model's sustained ceiling.

agent

LFM2 1.2BQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

194.6 tok/s

LFM2 8B-A1BQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

144.7 tok/s

Gemma-4 E2B-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

83.7 tok/s

Gemma-3 4b-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

61.3 tok/s

Gemma-4 E4B-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

51.6 tok/s

Qwen3.5 35B-A3BQ4_K_XL · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

46.0 tok/s

Gemma-4 26B-A4B-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

44.8 tok/s

Qwen3.5 27BQ4_K_XL · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

11.5 tok/s

Qwen3.6 27BQ4_K_XL · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

11.4 tok/s

Gemma-4 31B-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

9.6 tok/s

Mid-length prompt with tool-call shape, mid-length answer. Realistic for agentic loops. The big drop you'll see at concurrency 4 (in the per-model detail page) is the more useful agent number.

Caveat: reasoning models

Models that stream a hidden reasoning_content channel before the user-visible answer (Qwen3.5/3.6, DeepSeek-R1, GPT-OSS reasoning variants) currently count those tokens in output_tok_per_s. The decode rate is honest, but the rate of useful answer text is lower because some of every token budget is spent on the hidden chain-of-thought. The schema flag model.reasoningis not yet reliable across providers, so the per-model detail pages don't mark them explicitly.

Next: separate reasoning_tokens_median from content_tokens_median in the harness, and add a reasoning-disabled run mode (per-model: Qwen enable_thinking: false, DeepSeek /no_think, etc.).

Hardware tested

The rigs producing the numbers above. Use the hardware filter at the top of the page to scope results to a specific machine.

Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)75 runs

Framework Desktop· Mini-ITX

cpuAMD RYZEN AI MAX+ 395 w/ Radeon 8060S

gpuAMD Radeon 8060S

archStrix Halo (gfx1151)

vram96 GiB (system 31.1 GiB, unified)

osUbuntu 24.04.4 LTS

kernel7.0.2-2-pve

backendsllama.cpp b1203 (rocm), llama.cpp b8940 (cpu), llama.cpp b8940 (vulkan), llama.cpp b8940 (rocm), llama.cpp b1203 (rocm)

Framework Desktop with the AMD Ryzen AI Max+ 395 (Strix Halo) APU. 128 GiB of unified LPDDR5X system memory; the GPU side sees 96 GiB through the unified-memory pool. Integrated Radeon 8060S handles the inference workload via ROCm. No discrete GPU, no separate VRAM pool — the 27B-class models in this benchmark set all run on a single APU.

Coming soon

Queued for the next benchmark pass. Tracking notes live in docs/benchmark-campaign.md in the repo.

Strix vLLM FP8 + MTP-1 + draft-spec on Qwen3.6-27B. Blocked on lemonade's hardcoded backend-readiness timeout cutting off the first-load FP8 kernel autotune. Bypass via the bundled vLLM binary to warm the cache, then hand back to lemonade.
Strix quant sweep mirroring the 3090 Q2_K..Q6_K data on ROCm. Lemonade's async pull semantics broke the first attempt; switching to a raw llama-server invocation against the bundled ROCm binary.
Strix quant-creator comparison for one model from unsloth, bartowski, ggml-org where they all ship the same nominal quant.
Strix-only heavyweights: Mistral-Medium-3.5 128B, Mistral-Small-4 119B, Qwen3-Coder-Next 80B. Each is a 50-75 GB Q4_K_M download.
RTX 5070 CUDA pass. Currently Vulkan-only because CUDA toolkit install hit a packaging blocker on CachyOS; that's resolved now, so a llama.cpp CUDA build can land for a CUDA-vs-Vulkan comparison on the same NVIDIA silicon.
Driver and power-cap sweeps on the RTX 3090 once the rest of the matrix settles.