Skip to content

Benchmarks

Local LLM speed results across models, backends, hardware, and power profiles. Decode tok/s is the headline metric; latency, raw engine runs, and workload context stay visible in their own views.

1181 source rows414 matching source rowslatest run May 21, 2026schemas v1-v4source content/benchmarks/runs/
Filters
Advanced filters

Power rows are isolated here so normal model rankings are not swamped by intermediate cap sweeps and driver reruns.

35B-A3BthinkQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wchat
147.6
124ms
35B-A3BthinkQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wcodegen
147.4
159ms
35B-A3BthinkQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wrag
147.3
398ms
E4B-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wchat
147.3
69ms
35B-A3BthinkQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent
147.0
234ms
35B-A3BthinkQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wrag
146.9
356ms
7B-InstructQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wcodegen
146.8
47ms
35B-A3BthinkQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wcodegen
146.7
174ms
7B-InstructQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wrag
146.4
108ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 210 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-210wrag
146.3
344ms
E4B-itQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wcodegen
146.2
98ms
35B-A3BthinkQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent
146.0
222ms
7B-InstructQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent
145.6
71ms
E4B-itQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wrag
145.3
180ms
E4B-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wcodegen
145.1
96ms
E4B-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wrag
144.6
196ms
E4B-itQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent
144.4
148ms
E4B-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent
143.9
138ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-200w-595-r2chat
142.9
41ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-200wchat
142.9
45ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 210 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-210wcodegen
142.4
129ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 210 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-210wagent
141.7
237ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-200w-595-r2rag
138.1
372ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-200wrag
133.5
340ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-200wagent
128.7
249ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-200w-595-r2agent
128.4
246ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-200wcodegen
127.9
127ms
8bQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wchat
127.0
42ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 190 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-190wchat
127.0
43ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595
llama.cpp cuda-3e12fbd (cuda)baseline-pl-200w-595-r2codegen
126.7
125ms
8bQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wcodegen
124.8
121ms
8bQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wrag
124.4
223ms
8bQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wchat
124.1
45ms
8bQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent
123.1
189ms
8bQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wcodegen
122.5
110ms
8bQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wrag
122.1
244ms
8bQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent
120.7
173ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 190 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-190wrag
119.6
383ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 190 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-190wagent
113.9
248ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 190 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-190wcodegen
112.9
122ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 180 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-180wchat
110.9
44ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 180 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-180wrag
104.2
370ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 180 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-180wcodegen
100.1
124ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 180 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-180wagent
100.0
300ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 170 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-170wchat
96.8
45ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 170 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-170wrag
89.8
370ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 170 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-170wagent
87.5
250ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 170 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-170wcodegen
85.2
127ms
14B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450wchat
82.5
32ms
14B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350wchat
82.4
33ms
14B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350wchat
82.3
34ms
14B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450wchat
82.3
33ms
14B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350wcodegen
81.6
44ms
14B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350wrag
81.6
34ms
14B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450wcodegen
81.6
44ms
14B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350wrag
81.5
36ms
14B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450wrag
81.5
35ms
14B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450wrag
81.5
36ms
14B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350wcodegen
81.4
44ms
14B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450wcodegen
81.4
44ms
14B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350wagent
81.0
31ms
14B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450wagent
81.0
30ms
14B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350wagent
80.8
29ms
14B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450wagent
80.8
29ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 160 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-160wchat
79.8
48ms
14B-InstructQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp opt-build (cuda)pl-450wmixed_2048_768
78.9
14B-InstructQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp opt-build (cuda)pl-450wmixed_64_1024
78.4
14B-InstructQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 595
llama.cpp opt-build (cuda)pl-450wmixed_1024_1024
78.0
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 160 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-160wrag
75.4
374ms
14B-InstructQ4_K_M
GeForce RTX 3090 · 24 GiBcap 300 Wdrv 595
llama.cpp opt-build (cuda)pl-300wmixed_64_1024
74.2
14B-InstructQ4_K_M
GeForce RTX 3090 · 24 GiBcap 300 Wdrv 595
llama.cpp opt-build (cuda)pl-300wmixed_2048_768
74.0
14B-InstructQ4_K_M
GeForce RTX 3090 · 24 GiBcap 300 Wdrv 595
llama.cpp opt-build (cuda)pl-300wmixed_1024_1024
73.7
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 160 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-160wagent
71.6
293ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 160 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-160wcodegen
70.4
123ms
14B-InstructQ4_K_M
GeForce RTX 5070 · 12 GiB300 W maxdrv 595
llama.cpp b9174 (cuda)pl-300wmixed_2048_768
64.6
14B-InstructQ4_K_M
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp b9174 (cuda)pl-250wmixed_2048_768
64.6
14B-InstructQ4_K_M
GeForce RTX 5070 · 12 GiBcap 200 Wdrv 595
llama.cpp b9174 (cuda)pl-200wmixed_2048_768
64.4
14B-InstructQ4_K_M
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp b9174 (cuda)pl-250wmixed_64_1024
64.4
14B-InstructQ4_K_M
GeForce RTX 5070 · 12 GiB300 W maxdrv 595
llama.cpp b9174 (cuda)pl-300wmixed_64_1024
64.4
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 150 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-150wchat
64.4
55ms
14B-InstructQ4_K_M
GeForce RTX 5070 · 12 GiB300 W maxdrv 595
llama.cpp b9174 (cuda)pl-300wmixed_1024_1024
64.4
14B-InstructQ4_K_M
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp b9174 (cuda)pl-250wmixed_1024_1024
64.4
14B-InstructQ4_K_M
GeForce RTX 5070 · 12 GiBcap 200 Wdrv 595
llama.cpp b9174 (cuda)pl-200wmixed_64_1024
64.3
14B-InstructQ4_K_M
GeForce RTX 5070 · 12 GiBcap 200 Wdrv 595
llama.cpp b9174 (cuda)pl-200wmixed_1024_1024
64.2
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 150 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-150wrag
58.2
448ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 150 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-150wagent
58.2
287ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 150 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-150wcodegen
58.1
131ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 140 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-140wrag
52.6
429ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 140 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-140wchat
50.9
68ms
27B-MTPthinkQ8_0
2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 590
llama.cpp 4f13cb7-mtp (cuda)mtp-3-pl-200wcodegen
50.6
380ms
27B-MTPthinkQ8_0
2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 590
llama.cpp 4f13cb7-mtp (cuda)mtp-3-pl-200wchat
50.5
265ms
27B-MTPthinkQ8_0
2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 590
llama.cpp 4f13cb7-mtp (cuda)mtp-3-pl-200wagent
49.9
623ms
27B-MTPthinkQ8_0
2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 590
llama.cpp 4f13cb7-mtp (cuda)mtp-2-pl-200wcodegen
48.7
390ms
27B-MTPthinkQ8_0
2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 590
llama.cpp 4f13cb7-mtp (cuda)mtp-2-pl-200wchat
47.4
275ms
27B-MTPthinkQ8_0
2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 590
llama.cpp 4f13cb7-mtp (cuda)mtp-2-pl-200wagent
47.2
646ms
27BthinkQ2_K
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wchat
47.2
240ms
27BthinkQ2_K
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wrag
46.5
881ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 140 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-140wcodegen
46.3
131ms
27BthinkQ2_K
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wcodegen
45.9
321ms
27BthinkQ2_K
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent
45.6
509ms
Decode tok/s
Headline speed metric
TTFT / TPOT
Latency context
Raw vs workload
Separate comparison contracts
Notes badge key
hardware comparable

Use these rows for GPU-to-GPU comparisons when the model, quant, backend, driver family, power policy, and benchmark shape match closely.

stack comparable

Use these rows to compare a similar software stack. They are useful, but backend, server path, driver, cache, or power settings may still influence the number.

stack realistic

Treat these as real workload measurements, not pure hardware rankings. They include prompt mix, API/server overhead, cache behavior, and local software details.

legacyOlder workload harness row.
350 W capRecorded GPU power limit.
drv 590GPU driver branch.
reasoningReasoning-token model.
Metric guide
Decode tok/s - Generation rate. Raw rows come from the engine benchmark; API rows use token intervals when available.
TTFT - Time to first token. This includes prompt processing and server/API overhead.
TPOT / ITL - Time per output token after the first token. Lower is better.
Raw Engine - llama-bench style cases intended for hardware-normalized comparison across rigs.
Workload / API - Stack-realistic measurements that include backend, server, cache, driver, and prompt behavior.
Power badges - A cap badge shows the recorded power limit. The row metadata records the cap relative to the recorded max.