Benchmarks

Local LLM speed results across models, backends, hardware, and power profiles. Decode tok/s is the headline metric; latency, raw engine runs, and workload context stay visible in their own views.

1181 source rows835 matching source rowslatest run May 21, 2026schemas v1-v4source content/benchmarks/runs/

Leaderboard Hardware Raw engine Power Explorer

Explorer: Full row-level dataset with every suite, shape, mode, rerun, and technical metric.

What the tabs show

Leaderboard: Curated model rankings using workload-style decode speed at the selected concurrency.

Hardware: Rig details, drivers, power limits, and hardware microbenchmarks separated from model rankings.

Raw engine: llama-bench style prompt/decode cases for the closest hardware-normalized comparison.

Power: Power-limit sweep rows showing how caps change decode speed and latency.

Explorer: Full row-level dataset with every suite, shape, mode, rerun, and technical metric.

Filters

Advanced filters

Full row-level explorer. This is the place for raw shapes, hardware probes, cache/ppfix rows, dense power caps, and reruns.


4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 280 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-280w	chat	1	174.9	38ms	5.7	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 270 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-270w	chat	1	173.1	38ms	5.8	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 280 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-280w	codegen	1	171.4	128ms	5.8	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 260 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-260w	chat	1	171.3	37ms	5.8	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 280 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-280w	rag	1	170.6	367ms	5.9	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 270 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-270w	codegen	1	169.6	125ms	5.9	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 280 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-280w	agent	1	169.6	250ms	5.9	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 280 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-280w	agent	4	169.6	4.42s	5.9	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 250 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-250w	chat	1	169.2	37ms	5.9	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 270 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-270w	agent	1	168.0	229ms	6.0	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 270 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-270w	rag	1	168.0	361ms	6.0	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 270 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-270w	agent	4	168.0	4.30s	6.0	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 260 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-260w	codegen	1	166.9	124ms	6.0	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 260 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-260w	rag	1	166.7	340ms	6.0	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 260 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-260w	agent	4	165.1	4.30s	6.1	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 260 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-260w	agent	1	165.1	250ms	6.1	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 250 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-250w	codegen	1	164.9	131ms	6.1	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 240 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-240w	chat	1	164.7	40ms	6.1	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 250 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-250w	rag	1	164.6	328ms	6.1	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 250 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-250w	agent	4	163.5	4.37s	6.1	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 250 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-250w	agent	1	163.4	238ms	6.1	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 240 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-240w	codegen	1	162.1	117ms	6.2	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 230 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-230w	chat	1	161.8	38ms	6.2	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 240 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-240w	rag	1	161.3	389ms	6.2	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 240 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-240w	agent	1	161.0	230ms	6.2	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 220 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-220w	chat	1	161.0	39ms	6.2	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 240 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-240w	agent	4	160.6	3.87s	6.2	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 230 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-230w	rag	1	158.8	351ms	6.3	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 230 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-230w	codegen	1	158.4	120ms	6.3	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 230 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-230w	agent	4	156.8	4.80s	6.4	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 230 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-230w	agent	1	156.3	292ms	6.4	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 210 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-210w	chat	1	153.6	38ms	6.5	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 220 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-220w	rag	1	153.3	331ms	6.5	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 220 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-220w	agent	4	152.6	4.69s	6.6	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 220 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-220w	codegen	1	152.1	131ms	6.6	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 220 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-220w	agent	1	151.0	231ms	6.6	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 210 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-210w	rag	1	146.3	344ms	6.8	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-200w	chat	1	142.9	45ms	7.0	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 210 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-210w	codegen	1	142.4	129ms	7.0	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 210 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-210w	agent	4	142.1	5.34s	7.0	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 210 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-210w	agent	1	141.7	237ms	7.1	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-200w	rag	1	133.5	340ms	7.5	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-200w	agent	1	128.7	249ms	7.8	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-200w	agent	4	128.2	5.41s	7.8	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-200w	codegen	1	127.9	127ms	7.8	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 190 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-190w	chat	1	127.0	43ms	7.9	—	—	—
Flash	Q4_K_XL	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	chat	1	123.7	48ms	8.1	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 190 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-190w	rag	1	119.6	383ms	8.4	—	—	—
Flash	Q4_K_XL	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	codegen	1	119.3	116ms	8.4	—	—	—
Flash	Q4_K_XL	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	rag	1	118.9	206ms	8.4	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 190 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-190w	agent	4	114.5	6.02s	8.7	—	—	—
Flash	Q4_K_XL	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	agent	1	114.5	237ms	8.7	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 190 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-190w	agent	1	113.9	248ms	8.8	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 190 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-190w	codegen	1	112.9	122ms	8.9	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 180 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-180w	chat	1	110.9	44ms	9.0	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 180 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-180w	rag	1	104.2	370ms	9.6	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 180 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-180w	agent	4	100.7	7.26s	9.9	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 180 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-180w	codegen	1	100.1	124ms	10.0	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 180 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-180w	agent	1	100.0	300ms	10.0	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 170 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-170w	chat	1	96.8	45ms	10.3	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 170 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-170w	rag	1	89.8	370ms	11.1	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 170 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-170w	agent	1	87.5	250ms	11.4	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 170 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-170w	agent	4	85.7	8.30s	11.7	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 170 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-170w	codegen	1	85.2	127ms	11.7	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 160 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-160w	chat	1	79.8	48ms	12.5	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 160 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-160w	rag	1	75.4	374ms	13.3	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 160 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-160w	agent	1	71.6	293ms	14.0	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 160 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-160w	agent	4	71.5	9.85s	14.0	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 160 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-160w	codegen	1	70.4	123ms	14.2	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 150 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-150w	chat	1	64.4	55ms	15.5	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 150 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-150w	rag	1	58.2	448ms	17.2	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 150 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-150w	agent	1	58.2	287ms	17.2	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 150 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-150w	codegen	1	58.1	131ms	17.2	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 150 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-150w	agent	4	58.0	9.98s	17.2	—	—	—
Flash	Q4_K_XL	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	agent	4	55.1	1.05s	18.2	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 140 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-140w	rag	1	52.6	429ms	19.0	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 140 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-140w	chat	1	50.9	68ms	19.6	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 140 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-140w	codegen	1	46.3	131ms	21.6	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 140 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-140w	agent	4	45.5	13.16s	22.0	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 140 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-140w	agent	1	44.3	314ms	22.6	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 130 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-130w	chat	1	35.6	106ms	28.1	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 100 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-100w	chat	1	35.6	116ms	28.1	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 120 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-120w	chat	1	35.6	117ms	28.1	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 110 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-110w	chat	1	35.6	116ms	28.1	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 130 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-130w	codegen	1	35.3	137ms	28.4	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 120 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-120w	codegen	1	35.3	215ms	28.4	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 100 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-100w	codegen	1	35.3	210ms	28.4	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 110 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-110w	codegen	1	35.2	216ms	28.4	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 130 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-130w	agent	1	35.2	299ms	28.4	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 130 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-130w	agent	4	35.2	17.74s	28.4	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 110 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-110w	agent	1	35.1	352ms	28.5	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 110 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-110w	agent	4	35.1	19.35s	28.5	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 110 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-110w	rag	1	35.1	563ms	28.5	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 120 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-120w	rag	1	35.1	501ms	28.5	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 130 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-130w	rag	1	35.1	470ms	28.5	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 100 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-100w	rag	1	35.1	503ms	28.5	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 100 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-100w	agent	4	35.1	18.29s	28.5	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 120 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-120w	agent	1	35.0	359ms	28.6	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 120 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-120w	agent	4	34.9	16.01s	28.6	—	—	—
4b-it	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 100 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-100w	agent	1	34.9	348ms	28.6	—	—	—

Decode tok/s

Headline speed metric

TTFT / TPOT

Latency context

Raw vs workload

Separate comparison contracts

Notes badge key

hardware comparable

Use these rows for GPU-to-GPU comparisons when the model, quant, backend, driver family, power policy, and benchmark shape match closely.

stack comparable

Use these rows to compare a similar software stack. They are useful, but backend, server path, driver, cache, or power settings may still influence the number.

stack realistic

Treat these as real workload measurements, not pure hardware rankings. They include prompt mix, API/server overhead, cache behavior, and local software details.

legacyOlder workload harness row.

350 W capRecorded GPU power limit.

drv 590GPU driver branch.

reasoningReasoning-token model.

Metric guide

Decode tok/s - Generation rate. Raw rows come from the engine benchmark; API rows use token intervals when available.

TTFT - Time to first token. This includes prompt processing and server/API overhead.

TPOT / ITL - Time per output token after the first token. Lower is better.

Raw Engine - llama-bench style cases intended for hardware-normalized comparison across rigs.

Workload / API - Stack-realistic measurements that include backend, server, cache, driver, and prompt behavior.

Power badges - A cap badge shows the recorded power limit. The row metadata records the cap relative to the recorded max.