Benchmarks

Inference speed measurements for open-weight models across quantizations, backends, and hardware. Source YAMLs live in content/benchmarks/runs/.

Click a column header to sort. Hover the dotted-underlined labels for definitions. When no shape is selected, gen tok/s shows a range across all workload shapes at the chosen concurrency.

Gemma-4(16)released 2026-04

VariantQuantHardwareBackendConc.Gen tok/s
E2B-itQ4_K_MGeForce RTX 3090 · 24 GiBllama.cpp 59778f0 (cuda)1
181.1195.1
E4B-itQ4_K_MGeForce RTX 3090 · 24 GiBllama.cpp 59778f0 (cuda)1
101.9118.4
26B-A4B-itQ4_K_MGeForce RTX 3090 · 24 GiBllama.cpp 59778f0 (cuda)1
74.7101.4
31B-itQ4_K_MGeForce RTX 3090 · 24 GiBllama.cpp 59778f0 (cuda)1
15.219.4

granite-4.1(8)released 2026-04

VariantQuantHardwareBackendConc.Gen tok/s
8bQ4_K_MGeForce RTX 3090 · 24 GiBllama.cpp 59778f0 (cuda)1
63.874.4
30bQ4_K_MGeForce RTX 3090 · 24 GiBllama.cpp 59778f0 (cuda)1
19.521.0

NVIDIA-Nemotron-3-Nano-Omni(4)released 2026-03

VariantQuantHardwareBackendConc.Gen tok/s
30B-A3B-ReasoningQ4_K_MGeForce RTX 3090 · 24 GiBllama.cpp 59778f0 (cuda)1
109.6134.2

Qwen3.6(24)released 2026-03

VariantQuantHardwareBackendConc.Gen tok/s
27B-GGUF-Q2_KQ2_KGeForce RTX 3090 · 24 GiBllama.cpp 59778f0 (cuda)1
22.124.3
27B-GGUF-Q4_K_MQ4_K_MGeForce RTX 3090 · 24 GiBllama.cpp 59778f0 (cuda)1
20.021.6
27BQ4_K_XLGeForce RTX 3090 · 24 GiBllama.cpp 59778f0 (cuda)1
19.721.2
27B-GGUF-Q3_K_MQ3_K_MGeForce RTX 3090 · 24 GiBllama.cpp 59778f0 (cuda)1
19.320.9
27B-GGUF-Q5_K_MQ5_K_MGeForce RTX 3090 · 24 GiBllama.cpp 59778f0 (cuda)1
17.318.9
27B-GGUF-Q6_KQ6_KGeForce RTX 3090 · 24 GiBllama.cpp 59778f0 (cuda)1
14.415.5

LFM2.5-350M(4)released 2025-11

VariantQuantHardwareBackendConc.Gen tok/s
350MQ4_K_MGeForce RTX 3090 · 24 GiBllama.cpp 59778f0 (cuda)1
632.0813.8

Qwen3.5(8)released 2025-10

VariantQuantHardwareBackendConc.Gen tok/s
35B-A3BQ4_K_XLGeForce RTX 3090 · 24 GiBllama.cpp 59778f0 (cuda)1
94.2119.3
27BQ4_K_XLGeForce RTX 3090 · 24 GiBllama.cpp 59778f0 (cuda)1
19.821.4

GLM-4.7-Flash(4)released 2025-09

VariantQuantHardwareBackendConc.Gen tok/s
FlashQ4_K_XLGeForce RTX 3090 · 24 GiBllama.cpp 59778f0 (cuda)1
105.4117.5

LFM2(16)released 2025-07

VariantQuantHardwareBackendConc.Gen tok/s
1.2BQ4_K_MGeForce RTX 3090 · 24 GiBllama.cpp 59778f0 (cuda)1
426.4471.0
1.2B-ToolQ4_K_MGeForce RTX 3090 · 24 GiBllama.cpp 59778f0 (cuda)1
423.6465.3
8B-A1BQ4_K_MGeForce RTX 3090 · 24 GiBllama.cpp 59778f0 (cuda)1
278.6332.9
2.6BQ4_K_MGeForce RTX 3090 · 24 GiBllama.cpp 59778f0 (cuda)1
221.1238.9

Qwen3-Coder(4)released 2025-06

VariantQuantHardwareBackendConc.Gen tok/s
30B-A3B-InstructQ4_K_XLGeForce RTX 3090 · 24 GiBllama.cpp 59778f0 (cuda)1
117.1152.7

Gemma-3(4)released 2025-03

VariantQuantHardwareBackendConc.Gen tok/s
4b-itQ4_K_MGeForce RTX 3090 · 24 GiBllama.cpp 59778f0 (cuda)1
86.1142.0

Qwen2.5-Coder(4)released 2024-11

VariantQuantHardwareBackendConc.Gen tok/s
7B-InstructQ4_K_MGeForce RTX 3090 · 24 GiBllama.cpp 59778f0 (cuda)1
77.588.6

Qwen/Qwen2.5-Coder(12)released 2024-11

VariantQuantHardwareBackendConc.Gen tok/s
7B-InstructunknownGeForce RTX 3090 · 24 GiBvLLM 0.21.0 (cuda)1
76.985.8
14B-InstructunknownGeForce RTX 3090 · 24 GiBvLLM 0.21.0 (cuda)1
39.942.6
32B-InstructunknownGeForce RTX 3090 · 24 GiBvLLM 0.21.0 (cuda)1
18.919.5

Qwen/Qwen2.5(12)released 2024-09

VariantQuantHardwareBackendConc.Gen tok/s
7B-InstructunknownGeForce RTX 3090 · 24 GiBvLLM 0.21.0 (cuda)1
77.085.2
14B-InstructunknownGeForce RTX 3090 · 24 GiBvLLM 0.21.0 (cuda)1
38.942.6
32B-InstructunknownGeForce RTX 3090 · 24 GiBvLLM 0.21.0 (cuda)1
18.819.3

Tok/s by workload (concurrency 1)

Same models, four different usage patterns. Prefill and decode are bound by different limits, so the same model produces noticeably different tok/s depending on prompt and answer length. KV-cache size, batch size, and any hidden reasoningtokens the model emits also move the number. Use the ranges in "Model speed range" above as the headline; this section explains why those ranges exist.

chat
LFM2.5-350MQ4_K_M · GeForce RTX 3090 · 24 GiB
722.7 tok/s
LFM2 1.2B-ToolQ4_K_M · GeForce RTX 3090 · 24 GiB
459.9 tok/s
LFM2 1.2BQ4_K_M · GeForce RTX 3090 · 24 GiB
458.1 tok/s
LFM2 8B-A1BQ4_K_M · GeForce RTX 3090 · 24 GiB
318.7 tok/s
LFM2 2.6BQ4_K_M · GeForce RTX 3090 · 24 GiB
238.9 tok/s
Gemma-4 E2B-itQ4_K_M · GeForce RTX 3090 · 24 GiB
193.1 tok/s
Qwen3-Coder 30B-A3B-InstructQ4_K_XL · GeForce RTX 3090 · 24 GiB
146.2 tok/s
Gemma-3 4b-itQ4_K_M · GeForce RTX 3090 · 24 GiB
142.0 tok/s
NVIDIA-Nemotron-3-Nano-Omni 30B-A3B-ReasoningQ4_K_M · GeForce RTX 3090 · 24 GiB
123.5 tok/s
Gemma-4 E4B-itQ4_K_M · GeForce RTX 3090 · 24 GiB
118.4 tok/s
GLM-4.7-FlashQ4_K_XL · GeForce RTX 3090 · 24 GiB
117.5 tok/s
Qwen3.5 35B-A3BQ4_K_XL · GeForce RTX 3090 · 24 GiB
109.8 tok/s
Gemma-4 26B-A4B-itQ4_K_M · GeForce RTX 3090 · 24 GiB
100.5 tok/s
Qwen2.5-Coder 7B-InstructQ4_K_M · GeForce RTX 3090 · 24 GiB
88.6 tok/s
Qwen/Qwen2.5-Coder 7B-Instructunknown · GeForce RTX 3090 · 24 GiB
85.8 tok/s
Qwen/Qwen2.5 7B-Instructunknown · GeForce RTX 3090 · 24 GiB
85.2 tok/s
granite-4.1 8bQ4_K_M · GeForce RTX 3090 · 24 GiB
74.4 tok/s
Qwen/Qwen2.5-Coder 14B-Instructunknown · GeForce RTX 3090 · 24 GiB
42.6 tok/s
Qwen/Qwen2.5 14B-Instructunknown · GeForce RTX 3090 · 24 GiB
42.6 tok/s
Qwen3.6 27B-GGUF-Q2_KQ2_K · GeForce RTX 3090 · 24 GiB
24.0 tok/s
Qwen3.5 27BQ4_K_XL · GeForce RTX 3090 · 24 GiB
21.2 tok/s
Qwen3.6 27B-GGUF-Q4_K_MQ4_K_M · GeForce RTX 3090 · 24 GiB
21.1 tok/s
Qwen3.6 27BQ4_K_XL · GeForce RTX 3090 · 24 GiB
21.1 tok/s
granite-4.1 30bQ4_K_M · GeForce RTX 3090 · 24 GiB
21.0 tok/s
Qwen3.6 27B-GGUF-Q3_K_MQ3_K_M · GeForce RTX 3090 · 24 GiB
20.5 tok/s
Qwen/Qwen2.5-Coder 32B-Instructunknown · GeForce RTX 3090 · 24 GiB
19.5 tok/s
Gemma-4 31B-itQ4_K_M · GeForce RTX 3090 · 24 GiB
19.4 tok/s
Qwen/Qwen2.5 32B-Instructunknown · GeForce RTX 3090 · 24 GiB
19.2 tok/s
Qwen3.6 27B-GGUF-Q5_K_MQ5_K_M · GeForce RTX 3090 · 24 GiB
18.6 tok/s
Qwen3.6 27B-GGUF-Q6_KQ6_K · GeForce RTX 3090 · 24 GiB
15.3 tok/s

Short prompt, short answer. Generation-bound, so output tok/s is a clean reflection of the model's peak decode rate on this hardware.

rag
LFM2.5-350MQ4_K_M · GeForce RTX 3090 · 24 GiB
632.0 tok/s
LFM2 1.2BQ4_K_M · GeForce RTX 3090 · 24 GiB
426.4 tok/s
LFM2 1.2B-ToolQ4_K_M · GeForce RTX 3090 · 24 GiB
423.6 tok/s
LFM2 8B-A1BQ4_K_M · GeForce RTX 3090 · 24 GiB
278.6 tok/s
LFM2 2.6BQ4_K_M · GeForce RTX 3090 · 24 GiB
222.6 tok/s
Gemma-4 E2B-itQ4_K_M · GeForce RTX 3090 · 24 GiB
181.1 tok/s
Qwen3-Coder 30B-A3B-InstructQ4_K_XL · GeForce RTX 3090 · 24 GiB
117.1 tok/s
NVIDIA-Nemotron-3-Nano-Omni 30B-A3B-ReasoningQ4_K_M · GeForce RTX 3090 · 24 GiB
109.6 tok/s
GLM-4.7-FlashQ4_K_XL · GeForce RTX 3090 · 24 GiB
105.4 tok/s
Gemma-4 E4B-itQ4_K_M · GeForce RTX 3090 · 24 GiB
101.9 tok/s
Qwen3.5 35B-A3BQ4_K_XL · GeForce RTX 3090 · 24 GiB
94.2 tok/s
Gemma-3 4b-itQ4_K_M · GeForce RTX 3090 · 24 GiB
86.1 tok/s
Qwen/Qwen2.5-Coder 7B-Instructunknown · GeForce RTX 3090 · 24 GiB
78.7 tok/s
Qwen/Qwen2.5 7B-Instructunknown · GeForce RTX 3090 · 24 GiB
77.9 tok/s
Qwen2.5-Coder 7B-InstructQ4_K_M · GeForce RTX 3090 · 24 GiB
77.5 tok/s
Gemma-4 26B-A4B-itQ4_K_M · GeForce RTX 3090 · 24 GiB
74.7 tok/s
granite-4.1 8bQ4_K_M · GeForce RTX 3090 · 24 GiB
63.8 tok/s
Qwen/Qwen2.5-Coder 14B-Instructunknown · GeForce RTX 3090 · 24 GiB
39.9 tok/s
Qwen/Qwen2.5 14B-Instructunknown · GeForce RTX 3090 · 24 GiB
38.9 tok/s
Qwen3.6 27B-GGUF-Q2_KQ2_K · GeForce RTX 3090 · 24 GiB
22.1 tok/s
Qwen3.6 27B-GGUF-Q4_K_MQ4_K_M · GeForce RTX 3090 · 24 GiB
20.0 tok/s
Qwen3.5 27BQ4_K_XL · GeForce RTX 3090 · 24 GiB
19.8 tok/s
Qwen3.6 27BQ4_K_XL · GeForce RTX 3090 · 24 GiB
19.7 tok/s
granite-4.1 30bQ4_K_M · GeForce RTX 3090 · 24 GiB
19.5 tok/s
Qwen3.6 27B-GGUF-Q3_K_MQ3_K_M · GeForce RTX 3090 · 24 GiB
19.3 tok/s
Qwen/Qwen2.5-Coder 32B-Instructunknown · GeForce RTX 3090 · 24 GiB
18.9 tok/s
Qwen/Qwen2.5 32B-Instructunknown · GeForce RTX 3090 · 24 GiB
18.8 tok/s
Qwen3.6 27B-GGUF-Q5_K_MQ5_K_M · GeForce RTX 3090 · 24 GiB
17.3 tok/s
Gemma-4 31B-itQ4_K_M · GeForce RTX 3090 · 24 GiB
15.2 tok/s
Qwen3.6 27B-GGUF-Q6_KQ6_K · GeForce RTX 3090 · 24 GiB
14.4 tok/s

Long stuffed-context prompt, short answer. Prefill dominates time-to-first-token; gen tok/s usually dips slightly because the KV cache is hot but bigger.

codegen
LFM2.5-350MQ4_K_M · GeForce RTX 3090 · 24 GiB
813.8 tok/s
LFM2 1.2BQ4_K_M · GeForce RTX 3090 · 24 GiB
471.0 tok/s
LFM2 1.2B-ToolQ4_K_M · GeForce RTX 3090 · 24 GiB
465.3 tok/s
LFM2 8B-A1BQ4_K_M · GeForce RTX 3090 · 24 GiB
332.9 tok/s
LFM2 2.6BQ4_K_M · GeForce RTX 3090 · 24 GiB
232.4 tok/s
Gemma-4 E2B-itQ4_K_M · GeForce RTX 3090 · 24 GiB
195.1 tok/s
Qwen3-Coder 30B-A3B-InstructQ4_K_XL · GeForce RTX 3090 · 24 GiB
152.7 tok/s
Gemma-3 4b-itQ4_K_M · GeForce RTX 3090 · 24 GiB
137.0 tok/s
NVIDIA-Nemotron-3-Nano-Omni 30B-A3B-ReasoningQ4_K_M · GeForce RTX 3090 · 24 GiB
134.2 tok/s
Qwen3.5 35B-A3BQ4_K_XL · GeForce RTX 3090 · 24 GiB
119.3 tok/s
GLM-4.7-FlashQ4_K_XL · GeForce RTX 3090 · 24 GiB
117.4 tok/s
Gemma-4 E4B-itQ4_K_M · GeForce RTX 3090 · 24 GiB
117.2 tok/s
Gemma-4 26B-A4B-itQ4_K_M · GeForce RTX 3090 · 24 GiB
101.4 tok/s
Qwen2.5-Coder 7B-InstructQ4_K_M · GeForce RTX 3090 · 24 GiB
81.4 tok/s
Qwen/Qwen2.5-Coder 7B-Instructunknown · GeForce RTX 3090 · 24 GiB
77.4 tok/s
Qwen/Qwen2.5 7B-Instructunknown · GeForce RTX 3090 · 24 GiB
77.0 tok/s
granite-4.1 8bQ4_K_M · GeForce RTX 3090 · 24 GiB
70.7 tok/s
Qwen/Qwen2.5-Coder 14B-Instructunknown · GeForce RTX 3090 · 24 GiB
41.1 tok/s
Qwen/Qwen2.5 14B-Instructunknown · GeForce RTX 3090 · 24 GiB
40.0 tok/s
Qwen3.6 27B-GGUF-Q2_KQ2_K · GeForce RTX 3090 · 24 GiB
24.3 tok/s
Qwen3.6 27B-GGUF-Q4_K_MQ4_K_M · GeForce RTX 3090 · 24 GiB
21.6 tok/s
Qwen3.5 27BQ4_K_XL · GeForce RTX 3090 · 24 GiB
21.4 tok/s
Qwen3.6 27BQ4_K_XL · GeForce RTX 3090 · 24 GiB
21.2 tok/s
Qwen3.6 27B-GGUF-Q3_K_MQ3_K_M · GeForce RTX 3090 · 24 GiB
20.9 tok/s
granite-4.1 30bQ4_K_M · GeForce RTX 3090 · 24 GiB
20.9 tok/s
Qwen/Qwen2.5-Coder 32B-Instructunknown · GeForce RTX 3090 · 24 GiB
19.3 tok/s
Qwen/Qwen2.5 32B-Instructunknown · GeForce RTX 3090 · 24 GiB
19.3 tok/s
Gemma-4 31B-itQ4_K_M · GeForce RTX 3090 · 24 GiB
19.3 tok/s
Qwen3.6 27B-GGUF-Q5_K_MQ5_K_M · GeForce RTX 3090 · 24 GiB
18.9 tok/s
Qwen3.6 27B-GGUF-Q6_KQ6_K · GeForce RTX 3090 · 24 GiB
15.5 tok/s

Short prompt, long answer (~1k tokens). Pure decode loop. Numbers here tend to be the closest to the model's sustained ceiling.

agent
LFM2.5-350MQ4_K_M · GeForce RTX 3090 · 24 GiB
715.9 tok/s
LFM2 1.2BQ4_K_M · GeForce RTX 3090 · 24 GiB
446.9 tok/s
LFM2 1.2B-ToolQ4_K_M · GeForce RTX 3090 · 24 GiB
444.0 tok/s
LFM2 8B-A1BQ4_K_M · GeForce RTX 3090 · 24 GiB
315.9 tok/s
LFM2 2.6BQ4_K_M · GeForce RTX 3090 · 24 GiB
221.1 tok/s
Gemma-4 E2B-itQ4_K_M · GeForce RTX 3090 · 24 GiB
189.3 tok/s
Qwen3-Coder 30B-A3B-InstructQ4_K_XL · GeForce RTX 3090 · 24 GiB
141.1 tok/s
Gemma-3 4b-itQ4_K_M · GeForce RTX 3090 · 24 GiB
125.2 tok/s
NVIDIA-Nemotron-3-Nano-Omni 30B-A3B-ReasoningQ4_K_M · GeForce RTX 3090 · 24 GiB
121.5 tok/s
Gemma-4 E4B-itQ4_K_M · GeForce RTX 3090 · 24 GiB
111.7 tok/s
GLM-4.7-FlashQ4_K_XL · GeForce RTX 3090 · 24 GiB
111.2 tok/s
Qwen3.5 35B-A3BQ4_K_XL · GeForce RTX 3090 · 24 GiB
109.1 tok/s
Gemma-4 26B-A4B-itQ4_K_M · GeForce RTX 3090 · 24 GiB
94.3 tok/s
Qwen2.5-Coder 7B-InstructQ4_K_M · GeForce RTX 3090 · 24 GiB
79.9 tok/s
Qwen/Qwen2.5 7B-Instructunknown · GeForce RTX 3090 · 24 GiB
77.0 tok/s
Qwen/Qwen2.5-Coder 7B-Instructunknown · GeForce RTX 3090 · 24 GiB
76.9 tok/s
granite-4.1 8bQ4_K_M · GeForce RTX 3090 · 24 GiB
66.6 tok/s
Qwen/Qwen2.5-Coder 14B-Instructunknown · GeForce RTX 3090 · 24 GiB
40.6 tok/s
Qwen/Qwen2.5 14B-Instructunknown · GeForce RTX 3090 · 24 GiB
40.6 tok/s
Qwen3.6 27B-GGUF-Q2_KQ2_K · GeForce RTX 3090 · 24 GiB
23.7 tok/s
Qwen3.6 27B-GGUF-Q4_K_MQ4_K_M · GeForce RTX 3090 · 24 GiB
21.1 tok/s
Qwen3.5 27BQ4_K_XL · GeForce RTX 3090 · 24 GiB
20.9 tok/s
Qwen3.6 27BQ4_K_XL · GeForce RTX 3090 · 24 GiB
20.7 tok/s
Qwen3.6 27B-GGUF-Q3_K_MQ3_K_M · GeForce RTX 3090 · 24 GiB
20.5 tok/s
granite-4.1 30bQ4_K_M · GeForce RTX 3090 · 24 GiB
20.3 tok/s
Qwen/Qwen2.5 32B-Instructunknown · GeForce RTX 3090 · 24 GiB
19.2 tok/s
Qwen/Qwen2.5-Coder 32B-Instructunknown · GeForce RTX 3090 · 24 GiB
19.2 tok/s
Qwen3.6 27B-GGUF-Q5_K_MQ5_K_M · GeForce RTX 3090 · 24 GiB
18.2 tok/s
Gemma-4 31B-itQ4_K_M · GeForce RTX 3090 · 24 GiB
18.2 tok/s
Qwen3.6 27B-GGUF-Q6_KQ6_K · GeForce RTX 3090 · 24 GiB
15.0 tok/s

Mid-length prompt with tool-call shape, mid-length answer. Realistic for agentic loops. The big drop you'll see at concurrency 4 (in the per-model detail page) is the more useful agent number.

Caveat: reasoning models

Models that stream a hidden reasoning_content channel before the user-visible answer (Qwen3.5/3.6, DeepSeek-R1, GPT-OSS reasoning variants) currently count those tokens in output_tok_per_s. The decode rate is honest, but the rate of useful answer text is lower because some of every token budget is spent on the hidden chain-of-thought. The schema flag model.reasoningis not yet reliable across providers, so the per-model detail pages don't mark them explicitly.

Next: separate reasoning_tokens_median from content_tokens_median in the harness, and add a reasoning-disabled run mode (per-model: Qwen enable_thinking: false, DeepSeek /no_think, etc.).

Hardware tested

The rigs producing the numbers above. Use the hardware filter at the top of the page to scope results to a specific machine.

Custom 4× RTX 3090 build· Open-frame mining-style chassis
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power200 W / 450 W max(44% cap)
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driver590.48.01
backendsllama.cpp 59778f0 (cuda), vLLM 0.21.0 (cuda)

A self-built quad-3090 box that lives in the homelab as a general-purpose ML/inference node. Unless a run explicitly labels itself as multi-GPU, every RTX 3090 result on this page uses exactly one card via LXC GPU passthrough on Proxmox (/dev/nvidia0 for the vLLM container, /dev/nvidia1 for the llama.cpp container). Tensor-parallel and multi-card numbers will land separately and be tagged.

  • GPUs: 4× EVGA RTX 3090 FTW3 Ultra (24G-P5-3987-KR), each capped at 200 W of 450 W stock for thermals and PSU headroom
  • CPU: AMD EPYC 7302P (16C/32T, Zen 2, SP3)
  • Motherboard: ASRock Rack ROMED6U-2L2T
  • Memory: 96 GiB DDR4-2933 (6× 16 GiB ECC RDIMM)
  • Storage: 2 TB Samsung 980 Pro NVMe
  • Chassis: MLACOM Quad Station Pro Lite v3
  • Risers: 1× LINKUP AVA5 PCIe 5.0 straight 25 cm, 2× Okinos PCIe 4.0 150 mm, 1× Okinos PCIe 4.0 200 mm
  • PSUs: 2× Corsair RM1200x Shift (renewed), bridged with a dual-PSU ATX adapter

Coming soon

Queued for the next benchmark pass. Tracking notes live in docs/benchmark-campaign.md in the repo.

  • Strix vLLM FP8 + MTP-1 + draft-spec on Qwen3.6-27B. Blocked on lemonade's hardcoded backend-readiness timeout cutting off the first-load FP8 kernel autotune. Bypass via the bundled vLLM binary to warm the cache, then hand back to lemonade.
  • Strix quant sweep mirroring the 3090 Q2_K..Q6_K data on ROCm. Lemonade's async pull semantics broke the first attempt; switching to a raw llama-server invocation against the bundled ROCm binary.
  • Strix quant-creator comparison for one model from unsloth, bartowski, ggml-org where they all ship the same nominal quant.
  • Strix-only heavyweights: Mistral-Medium-3.5 128B, Mistral-Small-4 119B, Qwen3-Coder-Next 80B. Each is a 50-75 GB Q4_K_M download.
  • RTX 5070 CUDA pass. Currently Vulkan-only because CUDA toolkit install hit a packaging blocker on CachyOS; that's resolved now, so a llama.cpp CUDA build can land for a CUDA-vs-Vulkan comparison on the same NVIDIA silicon.
  • Driver and power-cap sweeps on the RTX 3090 once the rest of the matrix settles.