Benchmarks

Local LLM speed results across models, backends, hardware, and power profiles. Decode tok/s is the headline metric; latency, raw engine runs, and workload context stay visible in their own views.

1181 source rows405 matching source rowslatest run May 21, 2026schemas v1-v4source content/benchmarks/runs/

Leaderboard Hardware Raw engine Power Explorer

Power: Power-limit sweep rows showing how caps change decode speed and latency.

What the tabs show

Leaderboard: Curated model rankings using workload-style decode speed at the selected concurrency.

Hardware: Rig details, drivers, power limits, and hardware microbenchmarks separated from model rankings.

Raw engine: llama-bench style prompt/decode cases for the closest hardware-normalized comparison.

Power: Power-limit sweep rows showing how caps change decode speed and latency.

Explorer: Full row-level dataset with every suite, shape, mode, rerun, and technical metric.

Filters

Advanced filters

Power rows are isolated here so normal model rankings are not swamped by intermediate cap sweeps and driver reruns.


30b	Q4_K_M	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	codegen	42.6	214ms
30b	Q4_K_M	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	rag	42.5	361ms
27B-MTPthink	Q8_0	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 590	llama.cpp 4f13cb7-mtp (cuda)	mtp-2-pl-200w	rag	42.5	1.13s
30b	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	chat	42.4	76ms
27Bthink	Q4_K_M	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	rag	42.4	762ms
30b	Q4_K_M	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	42.3	280ms
27Bthink	Q4_K_M	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	codegen	42.2	358ms
27Bthink	Q4_K_M	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	rag	42.2	759ms
27Bthink	Q4_K_M	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	codegen	42.1	312ms
27Bthink	Q4_K_M	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	42.1	507ms
27Bthink	Q4_K_M	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	42.0	519ms
32B-Instruct	AWQ	GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590	vLLM 0.21.0 (cuda)	baseline-pl-350w	chat	42.0	61ms
32B-Instruct	AWQ	GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590	vLLM 0.21.0 (cuda)	baseline-pl-450w	chat	41.9	61ms
32B-Instruct	AWQ	GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590	vLLM 0.21.0 (cuda)	baseline-pl-450w	chat	41.9	61ms
32B-Instruct	AWQ	GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590	vLLM 0.21.0 (cuda)	baseline-pl-350w	chat	41.9	58ms
30b	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	codegen	41.9	205ms
27Bthink	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	chat	41.7	236ms
30b	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	rag	41.7	339ms
27Bthink	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	chat	41.7	237ms
32B-Instruct	AWQ	GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590	vLLM 0.21.0 (cuda)	baseline-pl-450w	rag	41.7	53ms
32B-Instruct	AWQ	GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590	vLLM 0.21.0 (cuda)	baseline-pl-350w	rag	41.6	53ms
30b	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	41.6	281ms
32B-Instruct	AWQ	GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590	vLLM 0.21.0 (cuda)	baseline-pl-350w	rag	41.6	53ms
32B-Instruct	AWQ	GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590	vLLM 0.21.0 (cuda)	baseline-pl-450w	rag	41.6	53ms
32B-Instruct	AWQ	GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590	vLLM 0.21.0 (cuda)	baseline-pl-350w	codegen	41.5	95ms
32B-Instruct	AWQ	GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590	vLLM 0.21.0 (cuda)	baseline-pl-450w	codegen	41.5	95ms
32B-Instruct	AWQ	GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590	vLLM 0.21.0 (cuda)	baseline-pl-350w	codegen	41.5	96ms
32B-Instruct	AWQ	GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590	vLLM 0.21.0 (cuda)	baseline-pl-450w	codegen	41.5	96ms
27Bthink	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	rag	41.4	785ms
27Bthink	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	codegen	41.4	305ms
27Bthink	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	rag	41.4	857ms
32B-Instruct	AWQ	GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590	vLLM 0.21.0 (cuda)	baseline-pl-350w	agent	41.4	54ms
32B-Instruct	AWQ	GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590	vLLM 0.21.0 (cuda)	baseline-pl-350w	agent	41.4	53ms
32B-Instruct	AWQ	GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590	vLLM 0.21.0 (cuda)	baseline-pl-450w	agent	41.4	54ms
32B-Instruct	AWQ	GeForce RTX 3090 · 24 GiBcap 420 Wdrv 590	vLLM 0.21.0 (cuda)	baseline-pl-450w	agent	41.3	53ms
27Bthink	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	codegen	41.3	302ms
27Bthink	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	41.3	506ms
27Bthink	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	41.2	500ms
27Bthink	Q4_K_XL	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	chat	41.1	255ms
27Bthink	Q4_K_XL	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	rag	40.8	760ms
27Bthink	Q4_K_XL	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	codegen	40.7	360ms
27Bthink	Q4_K_XL	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	40.6	514ms
27Bthink	Q4_K_XL	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	chat	40.3	235ms
27Bthink	Q4_K_XL	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	rag	40.0	784ms
27Bthink	Q4_K_XL	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	codegen	40.0	311ms
27Bthink	Q4_K_XL	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	39.9	504ms
27Bthink	Q3_K_M	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	chat	39.9	244ms
27Bthink	Q3_K_M	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	rag	39.2	778ms
14B-Instruct	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595	llama.cpp opt-build (cuda)	pl-200w	mixed_64_1024	38.7	—
27Bthink	Q3_K_M	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	codegen	38.7	296ms
27Bthink	Q3_K_M	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	38.5	488ms
27Bthink	Q5_K_M	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	chat	38.1	256ms
14B-Instruct	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595	llama.cpp opt-build (cuda)	pl-200w	mixed_1024_1024	37.9	—
14B-Instruct	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 595	llama.cpp opt-build (cuda)	pl-200w	mixed_2048_768	37.9	—
27Bthink	Q5_K_M	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	rag	37.8	751ms
27Bthink	Q5_K_M	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	codegen	37.7	348ms
27Bthink	Q3_K_M	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	chat	37.7	235ms
27Bthink	Q5_K_M	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	37.6	533ms
27Bthink	Q5_K_M	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	chat	37.4	251ms
27Bthink	Q3_K_M	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	rag	37.2	802ms
27Bthink	Q5_K_M	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	rag	37.1	895ms
27Bthink	Q5_K_M	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	codegen	37.1	323ms
27Bthink	Q3_K_M	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	codegen	37.1	344ms
27Bthink	Q5_K_M	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	37.0	516ms
27Bthink	Q3_K_M	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	37.0	505ms
4b-it	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 130 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-130w	chat	35.6	106ms
4b-it	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 100 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-100w	chat	35.6	116ms
4b-it	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 120 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-120w	chat	35.6	117ms
4b-it	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 110 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-110w	chat	35.6	116ms
4b-it	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 130 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-130w	codegen	35.3	137ms
4b-it	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 120 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-120w	codegen	35.3	215ms
4b-it	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 100 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-100w	codegen	35.3	210ms
4b-it	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 110 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-110w	codegen	35.2	216ms
4b-it	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 130 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-130w	agent	35.2	299ms
4b-it	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 110 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-110w	agent	35.1	352ms
4b-it	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 110 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-110w	rag	35.1	563ms
4b-it	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 120 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-120w	rag	35.1	501ms
4b-it	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 130 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-130w	rag	35.1	470ms
4b-it	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 100 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-100w	rag	35.1	503ms
4b-it	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 120 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-120w	agent	35.0	359ms
4b-it	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 100 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-100w	agent	34.9	348ms
27B-MTPthink	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 4f13cb7-mtp (cuda)	mtp-3-pl-200w	chat	34.2	283ms
27Bthink	Q6_K	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	chat	33.6	238ms
27Bthink	Q6_K	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	rag	33.4	789ms
27Bthink	Q6_K	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	codegen	33.3	371ms
27Bthink	Q6_K	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	33.2	533ms
27Bthink	Q6_K	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	chat	32.9	251ms
27Bthink	Q6_K	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	rag	32.6	804ms
27Bthink	Q6_K	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	codegen	32.5	326ms
27Bthink	Q6_K	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	32.5	511ms
27B-MTPthink	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 4f13cb7-mtp (cuda)	mtp-2-pl-200w	chat	32.0	271ms
27B-MTPthink	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 4f13cb7-mtp (cuda)	mtp-2-pl-200w	codegen	31.8	377ms
27B-MTPthink	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 4f13cb7-mtp (cuda)	mtp-3-pl-200w	rag	31.2	1.05s
27B-MTPthink	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 4f13cb7-mtp (cuda)	mtp-3-pl-200w	agent	31.2	621ms
27B-MTPthink	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 4f13cb7-mtp (cuda)	mtp-3-pl-200w	codegen	31.1	384ms
27B-MTPthink	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 4f13cb7-mtp (cuda)	mtp-2-pl-200w	agent	30.4	616ms
27B-MTPthink	Q4_K_M	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 4f13cb7-mtp (cuda)	mtp-2-pl-200w	rag	29.8	1.05s
27B-MTPthink	Q8_0	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 590	llama.cpp 4f13cb7-mtp (cuda)	baseline-pl-200w	chat	27.1	238ms
27B-MTPthink	Q8_0	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 590	llama.cpp 4f13cb7-mtp (cuda)	baseline-pl-200w	codegen	26.9	337ms
27B-MTPthink	Q8_0	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 590	llama.cpp 4f13cb7-mtp (cuda)	baseline-pl-200w	rag	26.9	911ms

Decode tok/s

Headline speed metric

TTFT / TPOT

Latency context

Raw vs workload

Separate comparison contracts

Notes badge key

hardware comparable

Use these rows for GPU-to-GPU comparisons when the model, quant, backend, driver family, power policy, and benchmark shape match closely.

stack comparable

Use these rows to compare a similar software stack. They are useful, but backend, server path, driver, cache, or power settings may still influence the number.

stack realistic

Treat these as real workload measurements, not pure hardware rankings. They include prompt mix, API/server overhead, cache behavior, and local software details.

legacyOlder workload harness row.

350 W capRecorded GPU power limit.

drv 590GPU driver branch.

reasoningReasoning-token model.

Metric guide

Decode tok/s - Generation rate. Raw rows come from the engine benchmark; API rows use token intervals when available.

TTFT - Time to first token. This includes prompt processing and server/API overhead.

TPOT / ITL - Time per output token after the first token. Lower is better.

Raw Engine - llama-bench style cases intended for hardware-normalized comparison across rigs.

Workload / API - Stack-realistic measurements that include backend, server, cache, driver, and prompt behavior.

Power badges - A cap badge shows the recorded power limit. The row metadata records the cap relative to the recorded max.