Benchmarks

Local LLM speed results across models, backends, hardware, and power profiles. Decode tok/s is the headline metric; latency, raw engine runs, and workload context stay visible in their own views.

1181 source rows835 matching source rowslatest run May 21, 2026schemas v1-v4source content/benchmarks/runs/

Leaderboard Hardware Raw engine Power Explorer

Explorer: Full row-level dataset with every suite, shape, mode, rerun, and technical metric.

What the tabs show

Leaderboard: Curated model rankings using workload-style decode speed at the selected concurrency.

Hardware: Rig details, drivers, power limits, and hardware microbenchmarks separated from model rankings.

Raw engine: llama-bench style prompt/decode cases for the closest hardware-normalized comparison.

Power: Power-limit sweep rows showing how caps change decode speed and latency.

Explorer: Full row-level dataset with every suite, shape, mode, rerun, and technical metric.

Filters

Advanced filters

Full row-level explorer. This is the place for raw shapes, hardware probes, cache/ppfix rows, dense power caps, and reruns.


35B-A3Bthink	Q4_K_XL	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	rag	1	123.8	488ms	8.1	—	—	—
35B-A3Bthink	Q4_K_XL	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	agent	1	123.1	481ms	8.1	—	—	—
35B-A3Bthink	Q4_K_XL	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	codegen	1	122.7	170ms	8.2	—	—	—
27Bthink	Q2_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	chat	1	47.2	240ms	21.2	—	—	—
27Bthink	Q2_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	rag	1	46.5	881ms	21.5	—	—	—
27Bthink	Q2_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	codegen	1	45.9	321ms	21.8	—	—	—
27Bthink	Q2_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	1	45.6	509ms	21.9	—	—	—
27Bthink	Q2_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	4	45.6	18.07s	21.9	—	—	—
27Bthink	Q2_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	chat	1	44.4	258ms	22.5	—	—	—
27Bthink	Q2_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	rag	1	43.8	830ms	22.8	—	—	—
27Bthink	Q2_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	codegen	1	43.6	334ms	22.9	—	—	—
27Bthink	Q2_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	1	43.5	588ms	23.0	—	—	—
27Bthink	Q2_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	4	43.5	18.71s	23.0	—	—	—
27Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	chat	1	42.6	235ms	23.5	—	—	—
27Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	rag	1	42.2	759ms	23.7	—	—	—
27Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	codegen	1	42.1	312ms	23.7	—	—	—
27Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	1	42.0	519ms	23.8	—	—	—
27Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	4	41.9	19.50s	23.8	—	—	—
27Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	chat	1	41.7	237ms	24.0	—	—	—
27Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	rag	1	41.4	857ms	24.2	—	—	—
27Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	codegen	1	41.3	302ms	24.2	—	—	—
27Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	1	41.2	500ms	24.3	—	—	—
27Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	4	41.2	19.80s	24.3	—	—	—
27Bthink	Q4_K_XL	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	chat	1	41.1	255ms	24.3	—	—	—
27Bthink	Q4_K_XL	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	rag	1	40.8	760ms	24.5	—	—	—
27Bthink	Q4_K_XL	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	codegen	1	40.7	360ms	24.6	—	—	—
27Bthink	Q4_K_XL	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	1	40.6	514ms	24.6	—	—	—
27Bthink	Q4_K_XL	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	4	40.5	19.92s	24.7	—	—	—
27Bthink	Q4_K_XL	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	chat	1	40.3	235ms	24.8	—	—	—
27Bthink	Q4_K_XL	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	rag	1	40.0	784ms	25.0	—	—	—
27Bthink	Q4_K_XL	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	codegen	1	40.0	311ms	25.0	—	—	—
27Bthink	Q4_K_XL	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	1	39.9	504ms	25.1	—	—	—
27Bthink	Q4_K_XL	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	4	39.9	20.43s	25.1	—	—	—
27Bthink	Q3_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	chat	1	39.9	244ms	25.1	—	—	—
27Bthink	Q3_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	rag	1	39.2	778ms	25.5	—	—	—
27Bthink	Q3_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	codegen	1	38.7	296ms	25.9	—	—	—
27Bthink	Q3_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	1	38.5	488ms	26.0	—	—	—
27Bthink	Q3_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	4	38.4	21.12s	26.0	—	—	—
27Bthink	Q5_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	chat	1	38.1	256ms	26.2	—	—	—
27Bthink	Q5_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	rag	1	37.8	751ms	26.4	—	—	—
27Bthink	Q5_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	codegen	1	37.7	348ms	26.5	—	—	—
27Bthink	Q3_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	chat	1	37.7	235ms	26.5	—	—	—
27Bthink	Q5_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	1	37.6	533ms	26.6	—	—	—
27Bthink	Q5_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	4	37.6	21.75s	26.6	—	—	—
27Bthink	Q5_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	chat	1	37.4	251ms	26.7	—	—	—
27Bthink	Q3_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	rag	1	37.2	802ms	26.9	—	—	—
27Bthink	Q5_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	rag	1	37.1	895ms	26.9	—	—	—
27Bthink	Q5_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	codegen	1	37.1	323ms	27.0	—	—	—
27Bthink	Q3_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	codegen	1	37.1	344ms	27.0	—	—	—
27Bthink	Q5_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	1	37.0	516ms	27.0	—	—	—
27Bthink	Q5_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	4	37.0	21.54s	27.0	—	—	—
27Bthink	Q3_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	1	37.0	505ms	27.1	—	—	—
27Bthink	Q3_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	4	36.9	21.69s	27.1	—	—	—
27Bthink	Q6_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	chat	1	33.6	238ms	29.7	—	—	—
27Bthink	Q6_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	rag	1	33.4	789ms	30.0	—	—	—
27Bthink	Q6_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	codegen	1	33.3	371ms	30.1	—	—	—
27Bthink	Q6_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	4	33.2	24.23s	30.2	—	—	—
27Bthink	Q6_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	1	33.2	533ms	30.2	—	—	—
27Bthink	Q6_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	chat	1	32.9	251ms	30.4	—	—	—
27Bthink	Q6_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	rag	1	32.6	804ms	30.7	—	—	—
27Bthink	Q6_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	codegen	1	32.5	326ms	30.8	—	—	—
27Bthink	Q6_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	1	32.5	511ms	30.8	—	—	—
27Bthink	Q6_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	4	32.4	24.79s	30.8	—	—	—
27B-MTPthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 4f13cb7-mtp (cuda)	mtp-2-pl-200w	chat	1	32.0	271ms	0.1	—	—	—
27B-MTPthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 4f13cb7-mtp (cuda)	mtp-2-pl-200w	codegen	1	31.8	377ms	0.1	—	—	—
27B-MTPthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 4f13cb7-mtp (cuda)	mtp-2-pl-200w	agent	1	30.4	616ms	0.0	—	—	—
27B-MTPthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 4f13cb7-mtp (cuda)	mtp-2-pl-200w	rag	1	29.8	1.05s	0.0	—	—	—
27Bthink	Q2_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	chat	1	26.3	250ms	38.0	—	—	—
27Bthink	Q2_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	rag	1	24.8	947ms	40.3	—	—	—
27Bthink	Q2_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	codegen	1	24.5	367ms	40.9	—	—	—
27Bthink	Q2_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	agent	1	24.4	614ms	41.1	—	—	—
27Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	chat	1	23.0	255ms	43.5	—	—	—
27Bthink	Q4_K_XL	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	chat	1	22.9	253ms	43.6	—	—	—
27B-MTPthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 4f13cb7-mtp (cuda)	baseline-pl-200w	chat	1	22.8	266ms	43.9	—	—	—
27B-MTPthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 4f13cb7-mtp (cuda)	baseline-pl-200w	rag	1	22.3	923ms	44.9	—	—	—
27Bthink	Q3_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	chat	1	22.3	259ms	44.9	—	—	—
27Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	rag	1	22.2	931ms	45.0	—	—	—
27Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	codegen	1	21.8	375ms	45.9	—	—	—
27B-MTPthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 4f13cb7-mtp (cuda)	baseline-pl-200w	codegen	1	21.8	345ms	45.9	—	—	—
27Bthink	Q4_K_XL	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	rag	1	21.7	919ms	46.0	—	—	—
27Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	agent	1	21.7	626ms	46.0	—	—	—
27B-MTPthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 4f13cb7-mtp (cuda)	baseline-pl-200w	agent	1	21.6	609ms	46.2	—	—	—
27Bthink	Q3_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	rag	1	21.4	938ms	46.8	—	—	—
27Bthink	Q4_K_XL	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	codegen	1	21.3	358ms	47.0	—	—	—
27Bthink	Q4_K_XL	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	agent	1	21.2	607ms	47.2	—	—	—
27Bthink	Q3_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	codegen	1	21.1	356ms	47.3	—	—	—
27Bthink	Q3_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	agent	1	21.1	595ms	47.4	—	—	—
27Bthink	Q5_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	chat	1	19.9	268ms	50.1	—	—	—
27Bthink	Q5_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	rag	1	19.3	1.20s	51.8	—	—	—
27Bthink	Q5_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	codegen	1	19.1	375ms	52.4	—	—	—
27Bthink	Q5_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	agent	1	19.0	1.16s	52.7	—	—	—
27Bthink	Q6_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	chat	1	16.2	281ms	61.5	—	—	—
27Bthink	Q6_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	rag	1	15.8	1.25s	63.3	—	—	—
27Bthink	Q6_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	codegen	1	15.5	364ms	64.4	—	—	—
27Bthink	Q6_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	agent	1	15.5	1.09s	64.5	—	—	—
27Bthink	Q2_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	agent	4	12.8	4.01s	77.9	—	—	—
27Bthink	Q3_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	agent	4	11.5	3.89s	87.3	—	—	—
27Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	agent	4	11.4	3.93s	87.5	—	—	—
27B-MTPthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 4f13cb7-mtp (cuda)	baseline-pl-200w	agent	4	11.2	4.04s	89.5	—	—	—
27Bthink	Q4_K_XL	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	agent	4	10.9	4.02s	91.7	—	—	—

Decode tok/s

Headline speed metric

TTFT / TPOT

Latency context

Raw vs workload

Separate comparison contracts

Notes badge key

hardware comparable

Use these rows for GPU-to-GPU comparisons when the model, quant, backend, driver family, power policy, and benchmark shape match closely.

stack comparable

Use these rows to compare a similar software stack. They are useful, but backend, server path, driver, cache, or power settings may still influence the number.

stack realistic

Treat these as real workload measurements, not pure hardware rankings. They include prompt mix, API/server overhead, cache behavior, and local software details.

legacyOlder workload harness row.

350 W capRecorded GPU power limit.

drv 590GPU driver branch.

reasoningReasoning-token model.

Metric guide

Decode tok/s - Generation rate. Raw rows come from the engine benchmark; API rows use token intervals when available.

TTFT - Time to first token. This includes prompt processing and server/API overhead.

TPOT / ITL - Time per output token after the first token. Lower is better.

Raw Engine - llama-bench style cases intended for hardware-normalized comparison across rigs.

Workload / API - Stack-realistic measurements that include backend, server, cache, driver, and prompt behavior.

Power badges - A cap badge shows the recorded power limit. The row metadata records the cap relative to the recorded max.