Benchmarks

Inference speed measurements for open-weight models across quantizations, backends, and hardware. Source YAMLs live in content/benchmarks/runs/.

Click a column header to sort. Hover the dotted-underlined labels for definitions. When no shape is selected, gen tok/s shows a range across all workload shapes at the chosen concurrency.

▸ Gemma-4(16)released 2026-04

Variant	Quant	Hardware	Backend	Conc.	Gen tok/s ↓
E2B-it	Q4_K_M	GeForce RTX 3090 · 24 GiB	llama.cpp 59778f0 (cuda)	1	181.1–195.1
E4B-it	Q4_K_M	GeForce RTX 3090 · 24 GiB	llama.cpp 59778f0 (cuda)	1	101.9–118.4
26B-A4B-it	Q4_K_M	GeForce RTX 3090 · 24 GiB	llama.cpp 59778f0 (cuda)	1	74.7–101.4
31B-it	Q4_K_M	GeForce RTX 3090 · 24 GiB	llama.cpp 59778f0 (cuda)	1	15.2–19.4

▸ granite-4.1(8)released 2026-04

Variant	Quant	Hardware	Backend	Conc.	Gen tok/s ↓
8b	Q4_K_M	GeForce RTX 3090 · 24 GiB	llama.cpp 59778f0 (cuda)	1	63.8–74.4
30b	Q4_K_M	GeForce RTX 3090 · 24 GiB	llama.cpp 59778f0 (cuda)	1	19.5–21.0

▸ NVIDIA-Nemotron-3-Nano-Omni(4)released 2026-03

Variant	Quant	Hardware	Backend	Conc.	Gen tok/s ↓
30B-A3B-Reasoning	Q4_K_M	GeForce RTX 3090 · 24 GiB	llama.cpp 59778f0 (cuda)	1	109.6–134.2

▸ Qwen3.6(24)released 2026-03

Variant	Quant	Hardware	Backend	Conc.	Gen tok/s ↓
27B-GGUF-Q2_K	Q2_K	GeForce RTX 3090 · 24 GiB	llama.cpp 59778f0 (cuda)	1	22.1–24.3
27B-GGUF-Q4_K_M	Q4_K_M	GeForce RTX 3090 · 24 GiB	llama.cpp 59778f0 (cuda)	1	20.0–21.6
27B	Q4_K_XL	GeForce RTX 3090 · 24 GiB	llama.cpp 59778f0 (cuda)	1	19.7–21.2
27B-GGUF-Q3_K_M	Q3_K_M	GeForce RTX 3090 · 24 GiB	llama.cpp 59778f0 (cuda)	1	19.3–20.9
27B-GGUF-Q5_K_M	Q5_K_M	GeForce RTX 3090 · 24 GiB	llama.cpp 59778f0 (cuda)	1	17.3–18.9
27B-GGUF-Q6_K	Q6_K	GeForce RTX 3090 · 24 GiB	llama.cpp 59778f0 (cuda)	1	14.4–15.5

▸ LFM2.5-350M(4)released 2025-11

Variant	Quant	Hardware	Backend	Conc.	Gen tok/s ↓
350M	Q4_K_M	GeForce RTX 3090 · 24 GiB	llama.cpp 59778f0 (cuda)	1	632.0–813.8

▸ Qwen3.5(8)released 2025-10

Variant	Quant	Hardware	Backend	Conc.	Gen tok/s ↓
35B-A3B	Q4_K_XL	GeForce RTX 3090 · 24 GiB	llama.cpp 59778f0 (cuda)	1	94.2–119.3
27B	Q4_K_XL	GeForce RTX 3090 · 24 GiB	llama.cpp 59778f0 (cuda)	1	19.8–21.4

▸ GLM-4.7-Flash(4)released 2025-09

Variant	Quant	Hardware	Backend	Conc.	Gen tok/s ↓
Flash	Q4_K_XL	GeForce RTX 3090 · 24 GiB	llama.cpp 59778f0 (cuda)	1	105.4–117.5

▸ LFM2(16)released 2025-07

Variant	Quant	Hardware	Backend	Conc.	Gen tok/s ↓
1.2B	Q4_K_M	GeForce RTX 3090 · 24 GiB	llama.cpp 59778f0 (cuda)	1	426.4–471.0
1.2B-Tool	Q4_K_M	GeForce RTX 3090 · 24 GiB	llama.cpp 59778f0 (cuda)	1	423.6–465.3
8B-A1B	Q4_K_M	GeForce RTX 3090 · 24 GiB	llama.cpp 59778f0 (cuda)	1	278.6–332.9
2.6B	Q4_K_M	GeForce RTX 3090 · 24 GiB	llama.cpp 59778f0 (cuda)	1	221.1–238.9

▸ Qwen3-Coder(4)released 2025-06

Variant	Quant	Hardware	Backend	Conc.	Gen tok/s ↓
30B-A3B-Instruct	Q4_K_XL	GeForce RTX 3090 · 24 GiB	llama.cpp 59778f0 (cuda)	1	117.1–152.7

▸ Gemma-3(4)released 2025-03

Variant	Quant	Hardware	Backend	Conc.	Gen tok/s ↓
4b-it	Q4_K_M	GeForce RTX 3090 · 24 GiB	llama.cpp 59778f0 (cuda)	1	86.1–142.0

▸ Qwen2.5-Coder(4)released 2024-11

Variant	Quant	Hardware	Backend	Conc.	Gen tok/s ↓
7B-Instruct	Q4_K_M	GeForce RTX 3090 · 24 GiB	llama.cpp 59778f0 (cuda)	1	77.5–88.6

▸ Qwen/Qwen2.5-Coder(12)released 2024-11

Variant	Quant	Hardware	Backend	Conc.	Gen tok/s ↓
7B-Instruct	unknown	GeForce RTX 3090 · 24 GiB	vLLM 0.21.0 (cuda)	1	76.9–85.8
14B-Instruct	unknown	GeForce RTX 3090 · 24 GiB	vLLM 0.21.0 (cuda)	1	39.9–42.6
32B-Instruct	unknown	GeForce RTX 3090 · 24 GiB	vLLM 0.21.0 (cuda)	1	18.9–19.5

▸ Qwen/Qwen2.5(12)released 2024-09

Variant	Quant	Hardware	Backend	Conc.	Gen tok/s ↓
7B-Instruct	unknown	GeForce RTX 3090 · 24 GiB	vLLM 0.21.0 (cuda)	1	77.0–85.2
14B-Instruct	unknown	GeForce RTX 3090 · 24 GiB	vLLM 0.21.0 (cuda)	1	38.9–42.6
32B-Instruct	unknown	GeForce RTX 3090 · 24 GiB	vLLM 0.21.0 (cuda)	1	18.8–19.3

Tok/s by workload (concurrency 1)

Same models, four different usage patterns. Prefill and decode are bound by different limits, so the same model produces noticeably different tok/s depending on prompt and answer length. KV-cache size, batch size, and any hidden reasoningtokens the model emits also move the number. Use the ranges in "Model speed range" above as the headline; this section explains why those ranges exist.

chat

LFM2.5-350MQ4_K_M · GeForce RTX 3090 · 24 GiB

722.7 tok/s

LFM2 1.2B-ToolQ4_K_M · GeForce RTX 3090 · 24 GiB

459.9 tok/s

LFM2 1.2BQ4_K_M · GeForce RTX 3090 · 24 GiB

458.1 tok/s

LFM2 8B-A1BQ4_K_M · GeForce RTX 3090 · 24 GiB

318.7 tok/s

LFM2 2.6BQ4_K_M · GeForce RTX 3090 · 24 GiB

238.9 tok/s

Gemma-4 E2B-itQ4_K_M · GeForce RTX 3090 · 24 GiB

193.1 tok/s

Qwen3-Coder 30B-A3B-InstructQ4_K_XL · GeForce RTX 3090 · 24 GiB

146.2 tok/s

Gemma-3 4b-itQ4_K_M · GeForce RTX 3090 · 24 GiB

142.0 tok/s

NVIDIA-Nemotron-3-Nano-Omni 30B-A3B-ReasoningQ4_K_M · GeForce RTX 3090 · 24 GiB

123.5 tok/s

Gemma-4 E4B-itQ4_K_M · GeForce RTX 3090 · 24 GiB

118.4 tok/s

GLM-4.7-FlashQ4_K_XL · GeForce RTX 3090 · 24 GiB

117.5 tok/s

Qwen3.5 35B-A3BQ4_K_XL · GeForce RTX 3090 · 24 GiB

109.8 tok/s

Gemma-4 26B-A4B-itQ4_K_M · GeForce RTX 3090 · 24 GiB

100.5 tok/s

Qwen2.5-Coder 7B-InstructQ4_K_M · GeForce RTX 3090 · 24 GiB

88.6 tok/s

Qwen/Qwen2.5-Coder 7B-Instructunknown · GeForce RTX 3090 · 24 GiB

85.8 tok/s

Qwen/Qwen2.5 7B-Instructunknown · GeForce RTX 3090 · 24 GiB

85.2 tok/s

granite-4.1 8bQ4_K_M · GeForce RTX 3090 · 24 GiB

74.4 tok/s

Qwen/Qwen2.5-Coder 14B-Instructunknown · GeForce RTX 3090 · 24 GiB

42.6 tok/s

Qwen/Qwen2.5 14B-Instructunknown · GeForce RTX 3090 · 24 GiB

42.6 tok/s

Qwen3.6 27B-GGUF-Q2_KQ2_K · GeForce RTX 3090 · 24 GiB

24.0 tok/s

Qwen3.5 27BQ4_K_XL · GeForce RTX 3090 · 24 GiB

21.2 tok/s

Qwen3.6 27B-GGUF-Q4_K_MQ4_K_M · GeForce RTX 3090 · 24 GiB

21.1 tok/s

Qwen3.6 27BQ4_K_XL · GeForce RTX 3090 · 24 GiB

21.1 tok/s

granite-4.1 30bQ4_K_M · GeForce RTX 3090 · 24 GiB

21.0 tok/s

Qwen3.6 27B-GGUF-Q3_K_MQ3_K_M · GeForce RTX 3090 · 24 GiB

20.5 tok/s

Qwen/Qwen2.5-Coder 32B-Instructunknown · GeForce RTX 3090 · 24 GiB

19.5 tok/s

Gemma-4 31B-itQ4_K_M · GeForce RTX 3090 · 24 GiB

19.4 tok/s

Qwen/Qwen2.5 32B-Instructunknown · GeForce RTX 3090 · 24 GiB

19.2 tok/s

Qwen3.6 27B-GGUF-Q5_K_MQ5_K_M · GeForce RTX 3090 · 24 GiB

18.6 tok/s

Qwen3.6 27B-GGUF-Q6_KQ6_K · GeForce RTX 3090 · 24 GiB

15.3 tok/s

Short prompt, short answer. Generation-bound, so output tok/s is a clean reflection of the model's peak decode rate on this hardware.

rag

LFM2.5-350MQ4_K_M · GeForce RTX 3090 · 24 GiB

632.0 tok/s

LFM2 1.2BQ4_K_M · GeForce RTX 3090 · 24 GiB

426.4 tok/s

LFM2 1.2B-ToolQ4_K_M · GeForce RTX 3090 · 24 GiB

423.6 tok/s

LFM2 8B-A1BQ4_K_M · GeForce RTX 3090 · 24 GiB

278.6 tok/s

LFM2 2.6BQ4_K_M · GeForce RTX 3090 · 24 GiB

222.6 tok/s

Gemma-4 E2B-itQ4_K_M · GeForce RTX 3090 · 24 GiB

181.1 tok/s

Qwen3-Coder 30B-A3B-InstructQ4_K_XL · GeForce RTX 3090 · 24 GiB

117.1 tok/s

NVIDIA-Nemotron-3-Nano-Omni 30B-A3B-ReasoningQ4_K_M · GeForce RTX 3090 · 24 GiB

109.6 tok/s

GLM-4.7-FlashQ4_K_XL · GeForce RTX 3090 · 24 GiB

105.4 tok/s

Gemma-4 E4B-itQ4_K_M · GeForce RTX 3090 · 24 GiB

101.9 tok/s

Qwen3.5 35B-A3BQ4_K_XL · GeForce RTX 3090 · 24 GiB

94.2 tok/s

Gemma-3 4b-itQ4_K_M · GeForce RTX 3090 · 24 GiB

86.1 tok/s

Qwen/Qwen2.5-Coder 7B-Instructunknown · GeForce RTX 3090 · 24 GiB

78.7 tok/s

Qwen/Qwen2.5 7B-Instructunknown · GeForce RTX 3090 · 24 GiB

77.9 tok/s

Qwen2.5-Coder 7B-InstructQ4_K_M · GeForce RTX 3090 · 24 GiB

77.5 tok/s

Gemma-4 26B-A4B-itQ4_K_M · GeForce RTX 3090 · 24 GiB

74.7 tok/s

granite-4.1 8bQ4_K_M · GeForce RTX 3090 · 24 GiB

63.8 tok/s

Qwen/Qwen2.5-Coder 14B-Instructunknown · GeForce RTX 3090 · 24 GiB

39.9 tok/s

Qwen/Qwen2.5 14B-Instructunknown · GeForce RTX 3090 · 24 GiB

38.9 tok/s

Qwen3.6 27B-GGUF-Q2_KQ2_K · GeForce RTX 3090 · 24 GiB

22.1 tok/s

Qwen3.6 27B-GGUF-Q4_K_MQ4_K_M · GeForce RTX 3090 · 24 GiB

20.0 tok/s

Qwen3.5 27BQ4_K_XL · GeForce RTX 3090 · 24 GiB

19.8 tok/s

Qwen3.6 27BQ4_K_XL · GeForce RTX 3090 · 24 GiB

19.7 tok/s

granite-4.1 30bQ4_K_M · GeForce RTX 3090 · 24 GiB

19.5 tok/s

Qwen3.6 27B-GGUF-Q3_K_MQ3_K_M · GeForce RTX 3090 · 24 GiB

19.3 tok/s

Qwen/Qwen2.5-Coder 32B-Instructunknown · GeForce RTX 3090 · 24 GiB

18.9 tok/s

Qwen/Qwen2.5 32B-Instructunknown · GeForce RTX 3090 · 24 GiB

18.8 tok/s

Qwen3.6 27B-GGUF-Q5_K_MQ5_K_M · GeForce RTX 3090 · 24 GiB

17.3 tok/s

Gemma-4 31B-itQ4_K_M · GeForce RTX 3090 · 24 GiB

15.2 tok/s

Qwen3.6 27B-GGUF-Q6_KQ6_K · GeForce RTX 3090 · 24 GiB

14.4 tok/s

Long stuffed-context prompt, short answer. Prefill dominates time-to-first-token; gen tok/s usually dips slightly because the KV cache is hot but bigger.

codegen

LFM2.5-350MQ4_K_M · GeForce RTX 3090 · 24 GiB

813.8 tok/s

LFM2 1.2BQ4_K_M · GeForce RTX 3090 · 24 GiB

471.0 tok/s

LFM2 1.2B-ToolQ4_K_M · GeForce RTX 3090 · 24 GiB

465.3 tok/s

LFM2 8B-A1BQ4_K_M · GeForce RTX 3090 · 24 GiB

332.9 tok/s

LFM2 2.6BQ4_K_M · GeForce RTX 3090 · 24 GiB

232.4 tok/s

Gemma-4 E2B-itQ4_K_M · GeForce RTX 3090 · 24 GiB

195.1 tok/s

Qwen3-Coder 30B-A3B-InstructQ4_K_XL · GeForce RTX 3090 · 24 GiB

152.7 tok/s

Gemma-3 4b-itQ4_K_M · GeForce RTX 3090 · 24 GiB

137.0 tok/s

NVIDIA-Nemotron-3-Nano-Omni 30B-A3B-ReasoningQ4_K_M · GeForce RTX 3090 · 24 GiB

134.2 tok/s

Qwen3.5 35B-A3BQ4_K_XL · GeForce RTX 3090 · 24 GiB

119.3 tok/s

GLM-4.7-FlashQ4_K_XL · GeForce RTX 3090 · 24 GiB

117.4 tok/s

Gemma-4 E4B-itQ4_K_M · GeForce RTX 3090 · 24 GiB

117.2 tok/s

Gemma-4 26B-A4B-itQ4_K_M · GeForce RTX 3090 · 24 GiB

101.4 tok/s

Qwen2.5-Coder 7B-InstructQ4_K_M · GeForce RTX 3090 · 24 GiB

81.4 tok/s

Qwen/Qwen2.5-Coder 7B-Instructunknown · GeForce RTX 3090 · 24 GiB

77.4 tok/s

Qwen/Qwen2.5 7B-Instructunknown · GeForce RTX 3090 · 24 GiB

77.0 tok/s

granite-4.1 8bQ4_K_M · GeForce RTX 3090 · 24 GiB

70.7 tok/s

Qwen/Qwen2.5-Coder 14B-Instructunknown · GeForce RTX 3090 · 24 GiB

41.1 tok/s

Qwen/Qwen2.5 14B-Instructunknown · GeForce RTX 3090 · 24 GiB

40.0 tok/s

Qwen3.6 27B-GGUF-Q2_KQ2_K · GeForce RTX 3090 · 24 GiB

24.3 tok/s

Qwen3.6 27B-GGUF-Q4_K_MQ4_K_M · GeForce RTX 3090 · 24 GiB

21.6 tok/s

Qwen3.5 27BQ4_K_XL · GeForce RTX 3090 · 24 GiB

21.4 tok/s

Qwen3.6 27BQ4_K_XL · GeForce RTX 3090 · 24 GiB

21.2 tok/s

Qwen3.6 27B-GGUF-Q3_K_MQ3_K_M · GeForce RTX 3090 · 24 GiB

20.9 tok/s

granite-4.1 30bQ4_K_M · GeForce RTX 3090 · 24 GiB

20.9 tok/s

Qwen/Qwen2.5-Coder 32B-Instructunknown · GeForce RTX 3090 · 24 GiB

19.3 tok/s

Qwen/Qwen2.5 32B-Instructunknown · GeForce RTX 3090 · 24 GiB

19.3 tok/s

Gemma-4 31B-itQ4_K_M · GeForce RTX 3090 · 24 GiB

19.3 tok/s

Qwen3.6 27B-GGUF-Q5_K_MQ5_K_M · GeForce RTX 3090 · 24 GiB

18.9 tok/s

Qwen3.6 27B-GGUF-Q6_KQ6_K · GeForce RTX 3090 · 24 GiB

15.5 tok/s

Short prompt, long answer (~1k tokens). Pure decode loop. Numbers here tend to be the closest to the model's sustained ceiling.

agent

LFM2.5-350MQ4_K_M · GeForce RTX 3090 · 24 GiB

715.9 tok/s

LFM2 1.2BQ4_K_M · GeForce RTX 3090 · 24 GiB

446.9 tok/s

LFM2 1.2B-ToolQ4_K_M · GeForce RTX 3090 · 24 GiB

444.0 tok/s

LFM2 8B-A1BQ4_K_M · GeForce RTX 3090 · 24 GiB

315.9 tok/s

LFM2 2.6BQ4_K_M · GeForce RTX 3090 · 24 GiB

221.1 tok/s

Gemma-4 E2B-itQ4_K_M · GeForce RTX 3090 · 24 GiB

189.3 tok/s

Qwen3-Coder 30B-A3B-InstructQ4_K_XL · GeForce RTX 3090 · 24 GiB

141.1 tok/s

Gemma-3 4b-itQ4_K_M · GeForce RTX 3090 · 24 GiB

125.2 tok/s

NVIDIA-Nemotron-3-Nano-Omni 30B-A3B-ReasoningQ4_K_M · GeForce RTX 3090 · 24 GiB

121.5 tok/s

Gemma-4 E4B-itQ4_K_M · GeForce RTX 3090 · 24 GiB

111.7 tok/s

GLM-4.7-FlashQ4_K_XL · GeForce RTX 3090 · 24 GiB

111.2 tok/s

Qwen3.5 35B-A3BQ4_K_XL · GeForce RTX 3090 · 24 GiB

109.1 tok/s

Gemma-4 26B-A4B-itQ4_K_M · GeForce RTX 3090 · 24 GiB

94.3 tok/s

Qwen2.5-Coder 7B-InstructQ4_K_M · GeForce RTX 3090 · 24 GiB

79.9 tok/s

Qwen/Qwen2.5 7B-Instructunknown · GeForce RTX 3090 · 24 GiB

77.0 tok/s

Qwen/Qwen2.5-Coder 7B-Instructunknown · GeForce RTX 3090 · 24 GiB

76.9 tok/s

granite-4.1 8bQ4_K_M · GeForce RTX 3090 · 24 GiB

66.6 tok/s

Qwen/Qwen2.5-Coder 14B-Instructunknown · GeForce RTX 3090 · 24 GiB

40.6 tok/s

Qwen/Qwen2.5 14B-Instructunknown · GeForce RTX 3090 · 24 GiB

40.6 tok/s

Qwen3.6 27B-GGUF-Q2_KQ2_K · GeForce RTX 3090 · 24 GiB

23.7 tok/s

Qwen3.6 27B-GGUF-Q4_K_MQ4_K_M · GeForce RTX 3090 · 24 GiB

21.1 tok/s

Qwen3.5 27BQ4_K_XL · GeForce RTX 3090 · 24 GiB

20.9 tok/s

Qwen3.6 27BQ4_K_XL · GeForce RTX 3090 · 24 GiB

20.7 tok/s

Qwen3.6 27B-GGUF-Q3_K_MQ3_K_M · GeForce RTX 3090 · 24 GiB

20.5 tok/s

granite-4.1 30bQ4_K_M · GeForce RTX 3090 · 24 GiB

20.3 tok/s

Qwen/Qwen2.5 32B-Instructunknown · GeForce RTX 3090 · 24 GiB

19.2 tok/s

Qwen/Qwen2.5-Coder 32B-Instructunknown · GeForce RTX 3090 · 24 GiB

19.2 tok/s

Qwen3.6 27B-GGUF-Q5_K_MQ5_K_M · GeForce RTX 3090 · 24 GiB

18.2 tok/s

Gemma-4 31B-itQ4_K_M · GeForce RTX 3090 · 24 GiB

18.2 tok/s

Qwen3.6 27B-GGUF-Q6_KQ6_K · GeForce RTX 3090 · 24 GiB

15.0 tok/s

Mid-length prompt with tool-call shape, mid-length answer. Realistic for agentic loops. The big drop you'll see at concurrency 4 (in the per-model detail page) is the more useful agent number.

Caveat: reasoning models

Models that stream a hidden reasoning_content channel before the user-visible answer (Qwen3.5/3.6, DeepSeek-R1, GPT-OSS reasoning variants) currently count those tokens in output_tok_per_s. The decode rate is honest, but the rate of useful answer text is lower because some of every token budget is spent on the hidden chain-of-thought. The schema flag model.reasoningis not yet reliable across providers, so the per-model detail pages don't mark them explicitly.

Next: separate reasoning_tokens_median from content_tokens_median in the harness, and add a reasoning-disabled run mode (per-model: Qwen enable_thinking: false, DeepSeek /no_think, etc.).

Hardware tested

The rigs producing the numbers above. Use the hardware filter at the top of the page to scope results to a specific machine.

GeForce RTX 3090 · 24 GiB150 runs

Custom 4× RTX 3090 build· Open-frame mining-style chassis

cpuAMD EPYC 7302P 16-Core Processor

gpuNVIDIA GeForce RTX 3090

archNVIDIA

vram24 GiB (system 64.0 GiB)

power200 W / 450 W max(44% cap)

osUbuntu 24.04 LTS

kernel6.17.13-7-pve

driver590.48.01

backendsllama.cpp 59778f0 (cuda), vLLM 0.21.0 (cuda)

A self-built quad-3090 box that lives in the homelab as a general-purpose ML/inference node. Unless a run explicitly labels itself as multi-GPU, every RTX 3090 result on this page uses exactly one card via LXC GPU passthrough on Proxmox (/dev/nvidia0 for the vLLM container, /dev/nvidia1 for the llama.cpp container). Tensor-parallel and multi-card numbers will land separately and be tagged.

GPUs: 4× EVGA RTX 3090 FTW3 Ultra (24G-P5-3987-KR), each capped at 200 W of 450 W stock for thermals and PSU headroom
CPU: AMD EPYC 7302P (16C/32T, Zen 2, SP3)
Motherboard: ASRock Rack ROMED6U-2L2T
Memory: 96 GiB DDR4-2933 (6× 16 GiB ECC RDIMM)
Storage: 2 TB Samsung 980 Pro NVMe
Chassis: MLACOM Quad Station Pro Lite v3
Risers: 1× LINKUP AVA5 PCIe 5.0 straight 25 cm, 2× Okinos PCIe 4.0 150 mm, 1× Okinos PCIe 4.0 200 mm
PSUs: 2× Corsair RM1200x Shift (renewed), bridged with a dual-PSU ATX adapter

Coming soon

Queued for the next benchmark pass. Tracking notes live in docs/benchmark-campaign.md in the repo.

Strix vLLM FP8 + MTP-1 + draft-spec on Qwen3.6-27B. Blocked on lemonade's hardcoded backend-readiness timeout cutting off the first-load FP8 kernel autotune. Bypass via the bundled vLLM binary to warm the cache, then hand back to lemonade.
Strix quant sweep mirroring the 3090 Q2_K..Q6_K data on ROCm. Lemonade's async pull semantics broke the first attempt; switching to a raw llama-server invocation against the bundled ROCm binary.
Strix quant-creator comparison for one model from unsloth, bartowski, ggml-org where they all ship the same nominal quant.
Strix-only heavyweights: Mistral-Medium-3.5 128B, Mistral-Small-4 119B, Qwen3-Coder-Next 80B. Each is a 50-75 GB Q4_K_M download.
RTX 5070 CUDA pass. Currently Vulkan-only because CUDA toolkit install hit a packaging blocker on CachyOS; that's resolved now, so a llama.cpp CUDA build can land for a CUDA-vs-Vulkan comparison on the same NVIDIA silicon.
Driver and power-cap sweeps on the RTX 3090 once the rest of the matrix settles.