Skip to content

Benchmarks

Local LLM speed results across models, backends, hardware, and power profiles. Decode tok/s is the headline metric; latency, raw engine runs, and workload context stay visible in their own views.

1181 source rows405 matching source rowslatest run May 21, 2026schemas v1-v4source content/benchmarks/runs/
Filters
Advanced filters

Power rows are isolated here so normal model rankings are not swamped by intermediate cap sweeps and driver reruns.

30bQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wcodegen
42.6
214ms
30bQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wrag
42.5
361ms
27B-MTPthinkQ8_0
2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 590
llama.cpp 4f13cb7-mtp (cuda)mtp-2-pl-200wrag
42.5
1.13s
30bQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wchat
42.4
76ms
27BthinkQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wrag
42.4
762ms
30bQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent
42.3
280ms
27BthinkQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wcodegen
42.2
358ms
27BthinkQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wrag
42.2
759ms
27BthinkQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wcodegen
42.1
312ms
27BthinkQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent
42.1
507ms
27BthinkQ4_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent
42.0
519ms
32B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350wchat
42.0
61ms
32B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450wchat
41.9
61ms
32B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450wchat
41.9
61ms
32B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350wchat
41.9
58ms
30bQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wcodegen
41.9
205ms
27BthinkQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wchat
41.7
236ms
30bQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wrag
41.7
339ms
27BthinkQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wchat
41.7
237ms
32B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450wrag
41.7
53ms
32B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350wrag
41.6
53ms
30bQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent
41.6
281ms
32B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350wrag
41.6
53ms
32B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450wrag
41.6
53ms
32B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350wcodegen
41.5
95ms
32B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450wcodegen
41.5
95ms
32B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350wcodegen
41.5
96ms
32B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450wcodegen
41.5
96ms
27BthinkQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wrag
41.4
785ms
27BthinkQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wcodegen
41.4
305ms
27BthinkQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wrag
41.4
857ms
32B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350wagent
41.4
54ms
32B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-350wagent
41.4
53ms
32B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450wagent
41.4
54ms
32B-InstructAWQ
GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590
vLLM 0.21.0 (cuda)baseline-pl-450wagent
41.3
53ms
27BthinkQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wcodegen
41.3
302ms
27BthinkQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent
41.3
506ms
27BthinkQ4_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent
41.2
500ms
27BthinkQ4_K_XL
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wchat
41.1
255ms
27BthinkQ4_K_XL
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wrag
40.8
760ms
27BthinkQ4_K_XL
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wcodegen
40.7
360ms
27BthinkQ4_K_XL
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent
40.6
514ms
27BthinkQ4_K_XL
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wchat
40.3
235ms
27BthinkQ4_K_XL
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wrag
40.0
784ms
27BthinkQ4_K_XL
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wcodegen
40.0
311ms
27BthinkQ4_K_XL
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent
39.9
504ms
27BthinkQ3_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wchat
39.9
244ms
27BthinkQ3_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wrag
39.2
778ms
14B-InstructQ4_K_M
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595
llama.cpp opt-build (cuda)pl-200wmixed_64_1024
38.7
27BthinkQ3_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wcodegen
38.7
296ms
27BthinkQ3_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent
38.5
488ms
27BthinkQ5_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wchat
38.1
256ms
14B-InstructQ4_K_M
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595
llama.cpp opt-build (cuda)pl-200wmixed_1024_1024
37.9
14B-InstructQ4_K_M
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595
llama.cpp opt-build (cuda)pl-200wmixed_2048_768
37.9
27BthinkQ5_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wrag
37.8
751ms
27BthinkQ5_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wcodegen
37.7
348ms
27BthinkQ3_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wchat
37.7
235ms
27BthinkQ5_K_M
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent
37.6
533ms
27BthinkQ5_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wchat
37.4
251ms
27BthinkQ3_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wrag
37.2
802ms
27BthinkQ5_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wrag
37.1
895ms
27BthinkQ5_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wcodegen
37.1
323ms
27BthinkQ3_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wcodegen
37.1
344ms
27BthinkQ5_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent
37.0
516ms
27BthinkQ3_K_M
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent
37.0
505ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 130 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-130wchat
35.6
106ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 100 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-100wchat
35.6
116ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 120 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-120wchat
35.6
117ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 110 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-110wchat
35.6
116ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 130 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-130wcodegen
35.3
137ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 120 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-120wcodegen
35.3
215ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 100 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-100wcodegen
35.3
210ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 110 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-110wcodegen
35.2
216ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 130 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-130wagent
35.2
299ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 110 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-110wagent
35.1
352ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 110 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-110wrag
35.1
563ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 120 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-120wrag
35.1
501ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 130 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-130wrag
35.1
470ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 100 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-100wrag
35.1
503ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 120 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-120wagent
35.0
359ms
4b-itQ4_K_M
GeForce RTX 3090 · 24 GiBcap 100 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-100wagent
34.9
348ms
27B-MTPthinkQ4_K_M
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 4f13cb7-mtp (cuda)mtp-3-pl-200wchat
34.2
283ms
27BthinkQ6_K
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wchat
33.6
238ms
27BthinkQ6_K
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wrag
33.4
789ms
27BthinkQ6_K
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wcodegen
33.3
371ms
27BthinkQ6_K
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent
33.2
533ms
27BthinkQ6_K
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wchat
32.9
251ms
27BthinkQ6_K
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wrag
32.6
804ms
27BthinkQ6_K
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wcodegen
32.5
326ms
27BthinkQ6_K
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent
32.5
511ms
27B-MTPthinkQ4_K_M
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 4f13cb7-mtp (cuda)mtp-2-pl-200wchat
32.0
271ms
27B-MTPthinkQ4_K_M
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 4f13cb7-mtp (cuda)mtp-2-pl-200wcodegen
31.8
377ms
27B-MTPthinkQ4_K_M
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 4f13cb7-mtp (cuda)mtp-3-pl-200wrag
31.2
1.05s
27B-MTPthinkQ4_K_M
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 4f13cb7-mtp (cuda)mtp-3-pl-200wagent
31.2
621ms
27B-MTPthinkQ4_K_M
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 4f13cb7-mtp (cuda)mtp-3-pl-200wcodegen
31.1
384ms
27B-MTPthinkQ4_K_M
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 4f13cb7-mtp (cuda)mtp-2-pl-200wagent
30.4
616ms
27B-MTPthinkQ4_K_M
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 4f13cb7-mtp (cuda)mtp-2-pl-200wrag
29.8
1.05s
27B-MTPthinkQ8_0
2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 590
llama.cpp 4f13cb7-mtp (cuda)baseline-pl-200wchat
27.1
238ms
27B-MTPthinkQ8_0
2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 590
llama.cpp 4f13cb7-mtp (cuda)baseline-pl-200wcodegen
26.9
337ms
27B-MTPthinkQ8_0
2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 590
llama.cpp 4f13cb7-mtp (cuda)baseline-pl-200wrag
26.9
911ms
Decode tok/s
Headline speed metric
TTFT / TPOT
Latency context
Raw vs workload
Separate comparison contracts
Notes badge key
hardware comparable

Use these rows for GPU-to-GPU comparisons when the model, quant, backend, driver family, power policy, and benchmark shape match closely.

stack comparable

Use these rows to compare a similar software stack. They are useful, but backend, server path, driver, cache, or power settings may still influence the number.

stack realistic

Treat these as real workload measurements, not pure hardware rankings. They include prompt mix, API/server overhead, cache behavior, and local software details.

legacyOlder workload harness row.
350 W capRecorded GPU power limit.
drv 590GPU driver branch.
reasoningReasoning-token model.
Metric guide
Decode tok/s - Generation rate. Raw rows come from the engine benchmark; API rows use token intervals when available.
TTFT - Time to first token. This includes prompt processing and server/API overhead.
TPOT / ITL - Time per output token after the first token. Lower is better.
Raw Engine - llama-bench style cases intended for hardware-normalized comparison across rigs.
Workload / API - Stack-realistic measurements that include backend, server, cache, driver, and prompt behavior.
Power badges - A cap badge shows the recorded power limit. The row metadata records the cap relative to the recorded max.