Benchmarks

Inference speed measurements for open-weight models across quantizations, backends, and hardware. Source YAMLs live in content/benchmarks/runs/.

Click a column header to sort. Hover the dotted-underlined labels for definitions. When no shape is selected, gen tok/s shows a range across all workload shapes at the chosen concurrency.

Gemma-4(32)released 2026-04

VariantQuantHardwareBackendConc.Gen tok/s
E2B-itQ4_K_MStrix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)llama.cpp b8940 (rocm)1
76.387.6
E4B-itQ4_K_MStrix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)llama.cpp b8940 (vulkan)1
50.453.8
E4B-itQ4_K_MStrix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)llama.cpp b1203 (rocm)1
48.152.4
26B-A4B-itQ4_K_MStrix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)llama.cpp b8940 (vulkan)1
43.247.9
26B-A4B-itQ4_K_MStrix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)llama.cpp b1203 (rocm)1
40.746.0
31B-itQ4_K_MStrix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)llama.cpp b1203 (rocm)1
9.110.2
E4B-itQ4_K_MStrix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)llama.cpp b8940 (cpu)1
8.610.2
26B-A4B-itQ4_K_MStrix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)llama.cpp b8940 (cpu)1
6.98.5

Qwen3.6(8)released 2026-03

VariantQuantHardwareBackendConc.Gen tok/s
27BQ4_K_XLStrix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)llama.cpp b8940 (vulkan)1
10.811.9
27BQ4_K_XLStrix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)llama.cpp b1203 (rocm)1
10.511.5

Qwen3.5(8)released 2025-10

VariantQuantHardwareBackendConc.Gen tok/s
35B-A3BQ4_K_XLStrix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)llama.cpp b1203 (rocm)1
42.448.3
27BQ4_K_XLStrix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)llama.cpp b1203 (rocm)1
10.911.9

LFM2(8)released 2025-07

VariantQuantHardwareBackendConc.Gen tok/s
1.2BQ4_K_MStrix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)llama.cpp b8940 (rocm)1
194.6208.8
8B-A1BQ4_K_MStrix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)llama.cpp b8940 (rocm)1
142.6151.4

Gemma-3(4)released 2025-03

VariantQuantHardwareBackendConc.Gen tok/s
4b-itQ4_K_MStrix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)llama.cpp b1203 (rocm)1
55.564.6

Tok/s by workload (concurrency 1)

Same models, four different usage patterns. Prefill and decode are bound by different limits, so the same model produces noticeably different tok/s depending on prompt and answer length. KV-cache size, batch size, and any hidden reasoningtokens the model emits also move the number. Use the ranges in "Model speed range" above as the headline; this section explains why those ranges exist.

chat
LFM2 1.2BQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
204.5 tok/s
LFM2 8B-A1BQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
144.6 tok/s
Gemma-4 E2B-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
86.2 tok/s
Gemma-3 4b-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
64.2 tok/s
Gemma-4 E4B-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
52.9 tok/s
Gemma-4 26B-A4B-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
47.7 tok/s
Qwen3.5 35B-A3BQ4_K_XL · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
46.0 tok/s
Qwen3.5 27BQ4_K_XL · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
11.6 tok/s
Qwen3.6 27BQ4_K_XL · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
11.6 tok/s
Gemma-4 31B-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
9.9 tok/s

Short prompt, short answer. Generation-bound, so output tok/s is a clean reflection of the model's peak decode rate on this hardware.

rag
LFM2 1.2BQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
194.7 tok/s
LFM2 8B-A1BQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
142.6 tok/s
Gemma-4 E2B-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
76.3 tok/s
Gemma-3 4b-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
55.5 tok/s
Gemma-4 E4B-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
50.4 tok/s
Gemma-4 26B-A4B-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
43.2 tok/s
Qwen3.5 35B-A3BQ4_K_XL · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
42.4 tok/s
Qwen3.5 27BQ4_K_XL · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
10.9 tok/s
Qwen3.6 27BQ4_K_XL · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
10.8 tok/s
Gemma-4 31B-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
9.1 tok/s

Long stuffed-context prompt, short answer. Prefill dominates time-to-first-token; gen tok/s usually dips slightly because the KV cache is hot but bigger.

codegen
LFM2 1.2BQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
208.8 tok/s
LFM2 8B-A1BQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
151.4 tok/s
Gemma-4 E2B-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
87.6 tok/s
Gemma-3 4b-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
64.6 tok/s
Gemma-4 E4B-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
53.8 tok/s
Qwen3.5 35B-A3BQ4_K_XL · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
48.3 tok/s
Gemma-4 26B-A4B-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
47.9 tok/s
Qwen3.5 27BQ4_K_XL · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
11.9 tok/s
Qwen3.6 27BQ4_K_XL · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
11.9 tok/s
Gemma-4 31B-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
10.2 tok/s

Short prompt, long answer (~1k tokens). Pure decode loop. Numbers here tend to be the closest to the model's sustained ceiling.

agent
LFM2 1.2BQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
194.6 tok/s
LFM2 8B-A1BQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
144.7 tok/s
Gemma-4 E2B-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
83.7 tok/s
Gemma-3 4b-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
61.3 tok/s
Gemma-4 E4B-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
51.6 tok/s
Qwen3.5 35B-A3BQ4_K_XL · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
46.0 tok/s
Gemma-4 26B-A4B-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
44.8 tok/s
Qwen3.5 27BQ4_K_XL · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
11.5 tok/s
Qwen3.6 27BQ4_K_XL · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
11.4 tok/s
Gemma-4 31B-itQ4_K_M · Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
9.6 tok/s

Mid-length prompt with tool-call shape, mid-length answer. Realistic for agentic loops. The big drop you'll see at concurrency 4 (in the per-model detail page) is the more useful agent number.

Caveat: reasoning models

Models that stream a hidden reasoning_content channel before the user-visible answer (Qwen3.5/3.6, DeepSeek-R1, GPT-OSS reasoning variants) currently count those tokens in output_tok_per_s. The decode rate is honest, but the rate of useful answer text is lower because some of every token budget is spent on the hidden chain-of-thought. The schema flag model.reasoningis not yet reliable across providers, so the per-model detail pages don't mark them explicitly.

Next: separate reasoning_tokens_median from content_tokens_median in the harness, and add a reasoning-disabled run mode (per-model: Qwen enable_thinking: false, DeepSeek /no_think, etc.).

Hardware tested

The rigs producing the numbers above. Use the hardware filter at the top of the page to scope results to a specific machine.

Framework Desktop· Mini-ITX
cpuAMD RYZEN AI MAX+ 395 w/ Radeon 8060S
gpuAMD Radeon 8060S
archStrix Halo (gfx1151)
vram96 GiB (system 31.1 GiB, unified)
osUbuntu 24.04.4 LTS
kernel7.0.2-2-pve
backendsllama.cpp b1203 (rocm), llama.cpp b8940 (cpu), llama.cpp b8940 (vulkan), llama.cpp b8940 (rocm), llama.cpp b1203 (rocm)

Framework Desktop with the AMD Ryzen AI Max+ 395 (Strix Halo) APU. 128 GiB of unified LPDDR5X system memory; the GPU side sees 96 GiB through the unified-memory pool. Integrated Radeon 8060S handles the inference workload via ROCm. No discrete GPU, no separate VRAM pool — the 27B-class models in this benchmark set all run on a single APU.

Coming soon

Queued for the next benchmark pass. Tracking notes live in docs/benchmark-campaign.md in the repo.

  • Strix vLLM FP8 + MTP-1 + draft-spec on Qwen3.6-27B. Blocked on lemonade's hardcoded backend-readiness timeout cutting off the first-load FP8 kernel autotune. Bypass via the bundled vLLM binary to warm the cache, then hand back to lemonade.
  • Strix quant sweep mirroring the 3090 Q2_K..Q6_K data on ROCm. Lemonade's async pull semantics broke the first attempt; switching to a raw llama-server invocation against the bundled ROCm binary.
  • Strix quant-creator comparison for one model from unsloth, bartowski, ggml-org where they all ship the same nominal quant.
  • Strix-only heavyweights: Mistral-Medium-3.5 128B, Mistral-Small-4 119B, Qwen3-Coder-Next 80B. Each is a 50-75 GB Q4_K_M download.
  • RTX 5070 CUDA pass. Currently Vulkan-only because CUDA toolkit install hit a packaging blocker on CachyOS; that's resolved now, so a llama.cpp CUDA build can land for a CUDA-vs-Vulkan comparison on the same NVIDIA silicon.
  • Driver and power-cap sweeps on the RTX 3090 once the rest of the matrix settles.