Skip to content

Benchmarks

Local LLM speed results across models, backends, hardware, and power profiles. Decode tok/s is the headline metric; latency, raw engine runs, and workload context stay visible in their own views.

1181 source rows405 matching source rowslatest run May 21, 2026schemas v1-v4source content/benchmarks/runs/
Filters
Advanced filters

Power rows are isolated here so normal model rankings are not swamped by intermediate cap sweeps and driver reruns.

350MQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wchat
940.7
9ms
350MQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wcodegen
936.3
14ms
350MQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent
935.5
15ms
350MQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wcodegen
934.6
14ms
350MQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent
927.6
17ms
350MQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wchat
925.1
8ms
350MQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wrag
924.2
22ms
350MQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wrag
922.5
18ms
1.2BQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wchat
617.7
13ms
1.2B-ToolQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wchat
617.3
11ms
1.2BQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wcodegen
613.1
19ms
1.2B-ToolQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wcodegen
613.1
17ms
1.2BQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wrag
611.6
34ms
1.2B-ToolQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wrag
609.8
33ms
1.2BQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wchat
609.0
12ms
1.2B-ToolQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent
609.0
20ms
1.2B-ToolQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wchat
608.6
12ms
1.2BQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent
607.9
20ms
1.2B-ToolQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wcodegen
605.7
17ms
1.2BQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wcodegen
605.0
16ms
1.2BQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wrag
602.4
47ms
1.2BQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent
601.7
23ms
1.2B-ToolQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wrag
601.0
34ms
1.2B-ToolQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent
600.6
19ms
8B-A1BQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wchat
439.9
21ms
8B-A1BQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wcodegen
438.0
34ms
8B-A1BQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wrag
436.7
52ms
8B-A1BQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent
435.2
22ms
8B-A1BQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wchat
433.7
21ms
8B-A1BQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wcodegen
430.3
33ms
8B-A1BQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent
428.3
23ms
8B-A1BQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wrag
425.0
53ms
2.6BQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wchat
332.6
17ms
2.6BQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wrag
329.2
55ms
2.6BQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wcodegen
328.2
32ms
2.6BQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent
326.6
28ms
2.6BQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wchat
324.9
18ms
2.6BQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wrag
322.4
57ms
2.6BQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wcodegen
321.9
33ms
2.6BQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent
320.5
42ms
E2B-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wchat
226.4
41ms
E2B-itQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wchat
226.2
40ms
E2B-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wrag
224.5
83ms
E2B-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wcodegen
224.3
56ms
E2B-itQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wcodegen
223.7
54ms
E2B-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent
223.3
51ms
E2B-itQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent
222.5
49ms
E2B-itQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wrag
222.2
82ms
30B-A3B-InstructthinkQ4_K_XL
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wchat
203.0
86ms
30B-A3B-InstructthinkQ4_K_XL
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wchat
201.7
89ms
30B-A3B-InstructthinkQ4_K_XL
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wcodegen
199.8
125ms
30B-A3B-InstructthinkQ4_K_XL
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wrag
198.9
205ms
30B-A3B-InstructthinkQ4_K_XL
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent
198.6
157ms
30B-A3B-InstructthinkQ4_K_XL
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wcodegen
197.8
120ms
30B-A3B-InstructthinkQ4_K_XL
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wrag
196.9
236ms
30B-A3B-InstructthinkQ4_K_XL
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent
196.0
164ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-450w-595-r3chat
186.7
36ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-450w-595chat
186.2
37ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 430 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-430wchat
185.2
37ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 410 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-410wchat
185.1
38ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-450w-595-r2chat
185.0
26ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-420wchat
184.9
37ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wchat
184.7
38ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 400 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-400wchat
184.6
37ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 440 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-440wchat
184.6
39ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 380 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-380wchat
184.0
37ms
30B-A3B-ReasoningthinkQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wchat
183.9
159ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-350w-595-r2chat
183.4
29ms
30B-A3B-ReasoningthinkQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wrag
183.4
291ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 370 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-370wchat
183.3
37ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 390 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-390wchat
183.3
37ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 360 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-360wchat
183.2
39ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-450w-595codegen
183.1
132ms
30B-A3B-ReasoningthinkQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wcodegen
183.1
191ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-450w-595-r3codegen
182.6
119ms
30B-A3B-ReasoningthinkQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent
182.5
208ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wchat
182.2
38ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 340 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-340wchat
181.9
37ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-450w-595-r3rag
181.9
339ms
30B-A3B-ReasoningthinkQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wchat
181.8
145ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-450w-595rag
181.6
392ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-450w-595-r2codegen
181.5
29ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wcodegen
181.3
129ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-450w-595agent
181.3
227ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 430 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-430wcodegen
181.2
135ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-420wcodegen
181.2
143ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 440 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-440wcodegen
181.1
121ms
30B-A3B-ReasoningthinkQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wrag
181.0
328ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 410 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-410wcodegen
181.0
120ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 330 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-330wchat
180.8
39ms
30B-A3B-ReasoningthinkQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wcodegen
180.8
204ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 400 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-400wcodegen
180.8
125ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-450w-595-r3agent
180.8
228ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wrag
180.6
339ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 380 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-380wcodegen
180.4
130ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-420wrag
180.3
424ms
30B-A3B-ReasoningthinkQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent
180.2
189ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-450w-595-r2rag
180.2
231ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 370 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-370wcodegen
180.2
129ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-450w-595-r2agent
180.1
203ms
Decode tok/s
Headline speed metric
TTFT / TPOT
Latency context
Raw vs workload
Separate comparison contracts
Notes badge key
hardware comparable

Use these rows for GPU-to-GPU comparisons when the model, quant, backend, driver family, power policy, and benchmark shape match closely.

stack comparable

Use these rows to compare a similar software stack. They are useful, but backend, server path, driver, cache, or power settings may still influence the number.

stack realistic

Treat these as real workload measurements, not pure hardware rankings. They include prompt mix, API/server overhead, cache behavior, and local software details.

legacyOlder workload harness row.
350 W capRecorded GPU power limit.
drv 590GPU driver branch.
reasoningReasoning-token model.
Metric guide
Decode tok/s - Generation rate. Raw rows come from the engine benchmark; API rows use token intervals when available.
TTFT - Time to first token. This includes prompt processing and server/API overhead.
TPOT / ITL - Time per output token after the first token. Lower is better.
Raw Engine - llama-bench style cases intended for hardware-normalized comparison across rigs.
Workload / API - Stack-realistic measurements that include backend, server, cache, driver, and prompt behavior.
Power badges - A cap badge shows the recorded power limit. The row metadata records the cap relative to the recorded max.