Skip to content

Benchmarks

Local LLM speed results across models, backends, hardware, and power profiles. Decode tok/s is the headline metric; latency, raw engine runs, and workload context stay visible in their own views.

1181 source rows405 matching source rowslatest run May 21, 2026schemas v1-v4source content/benchmarks/runs/
Filters
Advanced filters

Power rows are isolated here so normal model rankings are not swamped by intermediate cap sweeps and driver reruns.

4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 410 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-410wrag
180.1
333ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent
180.1
223ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 320 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-320wchat
180.0
42ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 390 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-390wcodegen
180.0
131ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 430 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-430wrag
180.0
326ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 440 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-440wagent
179.9
226ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-350w-595-r2codegen
179.8
29ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 380 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-380wrag
179.7
348ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 440 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-440wrag
179.7
354ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 430 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-430wagent
179.7
223ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 360 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-360wcodegen
179.5
119ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-420wagent
179.5
223ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 390 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-390wrag
179.4
323ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 400 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-400wagent
179.4
227ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 310 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-310wchat
179.3
38ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 410 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-410wagent
179.2
251ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 390 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-390wagent
179.1
246ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 380 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-380wagent
179.0
221ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-350w-595-r2rag
178.9
235ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 400 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-400wrag
178.6
341ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wcodegen
178.6
127ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 360 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-360wrag
178.4
345ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 340 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-340wcodegen
178.3
125ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 370 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-370wagent
178.2
234ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 370 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-370wrag
178.2
382ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 330 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-330wcodegen
177.8
132ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-300wchat
177.8
36ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wrag
177.7
387ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent
177.7
228ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 360 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-360wagent
177.7
240ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-350w-595-r2agent
177.4
218ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 340 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-340wrag
177.3
338ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 320 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-320wcodegen
176.7
131ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 290 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-290wchat
176.7
37ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 340 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-340wagent
176.7
262ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 330 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-330wagent
176.1
261ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 330 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-330wrag
176.0
339ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 310 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-310wcodegen
175.6
119ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 320 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-320wrag
175.5
330ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 320 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-320wagent
175.3
227ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 280 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-280wchat
174.9
38ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-300wcodegen
174.4
120ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 310 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-310wrag
174.3
351ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 310 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-310wagent
173.3
251ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-300wrag
173.2
341ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 270 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-270wchat
173.1
38ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 290 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-290wcodegen
172.9
127ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-300wagent
172.9
231ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 290 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-290wrag
172.2
343ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 280 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-280wcodegen
171.4
128ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 260 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-260wchat
171.3
37ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 290 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-290wagent
170.9
291ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 280 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-280wrag
170.6
367ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 270 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-270wcodegen
169.6
125ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 280 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-280wagent
169.6
250ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 250 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-250wchat
169.2
37ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 270 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-270wagent
168.0
229ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 270 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-270wrag
168.0
361ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 260 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-260wcodegen
166.9
124ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 260 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-260wrag
166.7
340ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 260 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-260wagent
165.1
250ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 250 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-250wcodegen
164.9
131ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 240 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-240wchat
164.7
40ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 250 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-250wrag
164.6
328ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 250 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-250wagent
163.4
238ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 240 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-240wcodegen
162.1
117ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 230 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-230wchat
161.8
38ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 240 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-240wrag
161.3
389ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 240 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-240wagent
161.0
230ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 220 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-220wchat
161.0
39ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 230 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-230wrag
158.8
351ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 230 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-230wcodegen
158.4
120ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 230 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-230wagent
156.3
292ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 210 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-210wchat
153.6
38ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 220 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-220wrag
153.3
331ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 220 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-220wcodegen
152.1
131ms
7B-InstructQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wchat
151.9
26ms
7B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350wchat
151.3
20ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 220 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-220wagent
151.0
231ms
7B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450wchat
150.9
21ms
7B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350wchat
150.8
22ms
7B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450wchat
150.7
22ms
7B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350wrag
150.3
28ms
7B-InstructQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wcodegen
150.1
47ms
7B-InstructQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wrag
149.8
104ms
7B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350wcodegen
149.7
27ms
7B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450wrag
149.7
24ms
7B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350wrag
149.6
25ms
7B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450wrag
149.6
27ms
7B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350wcodegen
149.5
27ms
7B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450wcodegen
149.3
27ms
7B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450wcodegen
149.2
27ms
7B-InstructQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent
148.9
75ms
7B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350wagent
148.9
22ms
35B-A3BthinkQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wchat
148.5
125ms
7B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450wagent
148.4
23ms
7B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350wagent
148.3
21ms
E4B-itQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wchat
148.2
59ms
7B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450wagent
148.2
23ms
7B-InstructQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wchat
147.9
36ms
Decode tok/s
Headline speed metric
TTFT / TPOT
Latency context
Raw vs workload
Separate comparison contracts
Notes badge key
hardware comparable

Use these rows for GPU-to-GPU comparisons when the model, quant, backend, driver family, power policy, and benchmark shape match closely.

stack comparable

Use these rows to compare a similar software stack. They are useful, but backend, server path, driver, cache, or power settings may still influence the number.

stack realistic

Treat these as real workload measurements, not pure hardware rankings. They include prompt mix, API/server overhead, cache behavior, and local software details.

legacyOlder workload harness row.
350 W capRecorded GPU power limit.
drv 590GPU driver branch.
reasoningReasoning-token model.
Metric guide
Decode tok/s - Generation rate. Raw rows come from the engine benchmark; API rows use token intervals when available.
TTFT - Time to first token. This includes prompt processing and server/API overhead.
TPOT / ITL - Time per output token after the first token. Lower is better.
Raw Engine - llama-bench style cases intended for hardware-normalized comparison across rigs.
Workload / API - Stack-realistic measurements that include backend, server, cache, driver, and prompt behavior.
Power badges - A cap badge shows the recorded power limit. The row metadata records the cap relative to the recorded max.