Benchmarks
Inference speed measurements for open-weight models across quantizations, backends, and hardware. Source YAMLs live in content/benchmarks/runs/.
Click a column header to sort. Hover the dotted-underlined labels for definitions. When no shape is selected, gen tok/s shows a range across all workload shapes at the chosen concurrency.
▸ Gemma-4(16)released 2026-04
| Variant | Quant | Hardware | Backend | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|
| E2B-it | Q4_K_M | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 181.1–195.1 |
| E4B-it | Q4_K_M | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 101.9–118.4 |
| 26B-A4B-it | Q4_K_M | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 74.7–101.4 |
| 31B-it | Q4_K_M | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 15.2–19.4 |
▸ granite-4.1(8)released 2026-04
▸ NVIDIA-Nemotron-3-Nano-Omni(4)released 2026-03
| Variant | Quant | Hardware | Backend | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|
| 30B-A3B-Reasoning | Q4_K_M | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 109.6–134.2 |
▸ Qwen3.6(24)released 2026-03
| Variant | Quant | Hardware | Backend | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|
| 27B-GGUF-Q2_K | Q2_K | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 22.1–24.3 |
| 27B-GGUF-Q4_K_M | Q4_K_M | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 20.0–21.6 |
| 27B | Q4_K_XL | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 19.7–21.2 |
| 27B-GGUF-Q3_K_M | Q3_K_M | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 19.3–20.9 |
| 27B-GGUF-Q5_K_M | Q5_K_M | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 17.3–18.9 |
| 27B-GGUF-Q6_K | Q6_K | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 14.4–15.5 |
▸ LFM2.5-350M(4)released 2025-11
| Variant | Quant | Hardware | Backend | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|
| 350M | Q4_K_M | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 632.0–813.8 |
▸ Qwen3.5(8)released 2025-10
▸ GLM-4.7-Flash(4)released 2025-09
| Variant | Quant | Hardware | Backend | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|
| Flash | Q4_K_XL | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 105.4–117.5 |
▸ LFM2(16)released 2025-07
| Variant | Quant | Hardware | Backend | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|
| 1.2B | Q4_K_M | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 426.4–471.0 |
| 1.2B-Tool | Q4_K_M | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 423.6–465.3 |
| 8B-A1B | Q4_K_M | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 278.6–332.9 |
| 2.6B | Q4_K_M | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 221.1–238.9 |
▸ Qwen3-Coder(4)released 2025-06
| Variant | Quant | Hardware | Backend | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|
| 30B-A3B-Instruct | Q4_K_XL | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 117.1–152.7 |
▸ Gemma-3(4)released 2025-03
| Variant | Quant | Hardware | Backend | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|
| 4b-it | Q4_K_M | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 86.1–142.0 |
▸ Qwen2.5-Coder(4)released 2024-11
| Variant | Quant | Hardware | Backend | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|
| 7B-Instruct | Q4_K_M | GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | 1 | 77.5–88.6 |
▸ Qwen/Qwen2.5-Coder(12)released 2024-11
| Variant | Quant | Hardware | Backend | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|
| 7B-Instruct | unknown | GeForce RTX 3090 · 24 GiB | vLLM 0.21.0 (cuda) | 1 | 76.9–85.8 |
| 14B-Instruct | unknown | GeForce RTX 3090 · 24 GiB | vLLM 0.21.0 (cuda) | 1 | 39.9–42.6 |
| 32B-Instruct | unknown | GeForce RTX 3090 · 24 GiB | vLLM 0.21.0 (cuda) | 1 | 18.9–19.5 |
▸ Qwen/Qwen2.5(12)released 2024-09
| Variant | Quant | Hardware | Backend | Conc. | Gen tok/s ↓ |
|---|---|---|---|---|---|
| 7B-Instruct | unknown | GeForce RTX 3090 · 24 GiB | vLLM 0.21.0 (cuda) | 1 | 77.0–85.2 |
| 14B-Instruct | unknown | GeForce RTX 3090 · 24 GiB | vLLM 0.21.0 (cuda) | 1 | 38.9–42.6 |
| 32B-Instruct | unknown | GeForce RTX 3090 · 24 GiB | vLLM 0.21.0 (cuda) | 1 | 18.8–19.3 |
Tok/s by workload (concurrency 1)
Same models, four different usage patterns. Prefill and decode are bound by different limits, so the same model produces noticeably different tok/s depending on prompt and answer length. KV-cache size, batch size, and any hidden reasoningtokens the model emits also move the number. Use the ranges in "Model speed range" above as the headline; this section explains why those ranges exist.
Short prompt, short answer. Generation-bound, so output tok/s is a clean reflection of the model's peak decode rate on this hardware.
Long stuffed-context prompt, short answer. Prefill dominates time-to-first-token; gen tok/s usually dips slightly because the KV cache is hot but bigger.
Short prompt, long answer (~1k tokens). Pure decode loop. Numbers here tend to be the closest to the model's sustained ceiling.
Mid-length prompt with tool-call shape, mid-length answer. Realistic for agentic loops. The big drop you'll see at concurrency 4 (in the per-model detail page) is the more useful agent number.
Models that stream a hidden reasoning_content channel before the user-visible answer (Qwen3.5/3.6, DeepSeek-R1, GPT-OSS reasoning variants) currently count those tokens in output_tok_per_s. The decode rate is honest, but the rate of useful answer text is lower because some of every token budget is spent on the hidden chain-of-thought. The schema flag model.reasoningis not yet reliable across providers, so the per-model detail pages don't mark them explicitly.
Next: separate reasoning_tokens_median from content_tokens_median in the harness, and add a reasoning-disabled run mode (per-model: Qwen enable_thinking: false, DeepSeek /no_think, etc.).
Hardware tested
The rigs producing the numbers above. Use the hardware filter at the top of the page to scope results to a specific machine.
A self-built quad-3090 box that lives in the homelab as a general-purpose ML/inference node. Unless a run explicitly labels itself as multi-GPU, every RTX 3090 result on this page uses exactly one card via LXC GPU passthrough on Proxmox (/dev/nvidia0 for the vLLM container, /dev/nvidia1 for the llama.cpp container). Tensor-parallel and multi-card numbers will land separately and be tagged.
- GPUs: 4× EVGA RTX 3090 FTW3 Ultra (24G-P5-3987-KR), each capped at 200 W of 450 W stock for thermals and PSU headroom
- CPU: AMD EPYC 7302P (16C/32T, Zen 2, SP3)
- Motherboard: ASRock Rack ROMED6U-2L2T
- Memory: 96 GiB DDR4-2933 (6× 16 GiB ECC RDIMM)
- Storage: 2 TB Samsung 980 Pro NVMe
- Chassis: MLACOM Quad Station Pro Lite v3
- Risers: 1× LINKUP AVA5 PCIe 5.0 straight 25 cm, 2× Okinos PCIe 4.0 150 mm, 1× Okinos PCIe 4.0 200 mm
- PSUs: 2× Corsair RM1200x Shift (renewed), bridged with a dual-PSU ATX adapter
Coming soon
Queued for the next benchmark pass. Tracking notes live in docs/benchmark-campaign.md in the repo.
- Strix vLLM FP8 + MTP-1 + draft-spec on Qwen3.6-27B. Blocked on lemonade's hardcoded backend-readiness timeout cutting off the first-load FP8 kernel autotune. Bypass via the bundled vLLM binary to warm the cache, then hand back to lemonade.
- Strix quant sweep mirroring the 3090 Q2_K..Q6_K data on ROCm. Lemonade's async pull semantics broke the first attempt; switching to a raw
llama-serverinvocation against the bundled ROCm binary. - Strix quant-creator comparison for one model from unsloth, bartowski, ggml-org where they all ship the same nominal quant.
- Strix-only heavyweights: Mistral-Medium-3.5 128B, Mistral-Small-4 119B, Qwen3-Coder-Next 80B. Each is a 50-75 GB Q4_K_M download.
- RTX 5070 CUDA pass. Currently Vulkan-only because CUDA toolkit install hit a packaging blocker on CachyOS; that's resolved now, so a llama.cpp CUDA build can land for a CUDA-vs-Vulkan comparison on the same NVIDIA silicon.
- Driver and power-cap sweeps on the RTX 3090 once the rest of the matrix settles.