Benchmarks

Local LLM speed results across models, backends, hardware, and power profiles. Decode tok/s is the headline metric; latency, raw engine runs, and workload context stay visible in their own views.

1181 source rows1181 matching source rowslatest run May 21, 2026schemas v1-v4source content/benchmarks/runs/

Leaderboard Hardware Raw engine Power Explorer

Explorer: Full row-level dataset with every suite, shape, mode, rerun, and technical metric.

What the tabs show

Leaderboard: Curated model rankings using workload-style decode speed at the selected concurrency.

Hardware: Rig details, drivers, power limits, and hardware microbenchmarks separated from model rankings.

Raw engine: llama-bench style prompt/decode cases for the closest hardware-normalized comparison.

Power: Power-limit sweep rows showing how caps change decode speed and latency.

Explorer: Full row-level dataset with every suite, shape, mode, rerun, and technical metric.

Filters

Advanced filters

Full row-level explorer. This is the place for raw shapes, hardware probes, cache/ppfix rows, dense power caps, and reruns.


35B-A3Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	chat	1	148.5	125ms	6.7	—	—	—
35B-A3Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	chat	1	147.6	124ms	6.8	—	—	—
35B-A3Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	codegen	1	147.4	159ms	6.8	—	—	—
35B-A3Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	rag	1	147.3	398ms	6.8	—	—	—
35B-A3Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	1	147.0	234ms	6.8	—	—	—
35B-A3Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	rag	1	146.9	356ms	6.8	—	—	—
35B-A3Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	codegen	1	146.7	174ms	6.8	—	—	—
35B-A3Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	4	146.6	5.82s	6.8	—	—	—
35B-A3Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	1	146.0	222ms	6.9	—	—	—
35B-A3Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	4	145.7	5.83s	6.9	—	—	—
35B-A3Bthink	Q4_K_XL	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	chat	1	126.6	123ms	7.9	—	—	—
35B-A3Bthink	Q4_K_XL	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	rag	1	123.8	488ms	8.1	—	—	—
35B-A3Bthink	Q4_K_XL	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	agent	1	123.1	481ms	8.1	—	—	—
35B-A3Bthink	Q4_K_XL	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	codegen	1	122.7	170ms	8.2	—	—	—
35B-A3Bthink	Q4_K_XL	legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp b1203 (rocm)	baseline	chat	1	48.9	149ms	20.4	—	—	—
35B-A3Bthink	Q4_K_XL	legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp b1203 (rocm)	baseline	codegen	1	48.9	197ms	20.4	—	—	—
35B-A3Bthink	Q4_K_XL	legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp b1203 (rocm)	baseline	agent	1	48.9	639ms	20.5	—	—	—
35B-A3Bthink	Q4_K_XL	legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp b1203 (rocm)	baseline	rag	1	48.8	631ms	20.5	—	—	—
27Bthink	Q2_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	chat	1	47.2	240ms	21.2	—	—	—
27Bthink	Q2_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	rag	1	46.5	881ms	21.5	—	—	—
27Bthink	Q2_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	codegen	1	45.9	321ms	21.8	—	—	—
27Bthink	Q2_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	1	45.6	509ms	21.9	—	—	—
27Bthink	Q2_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	4	45.6	18.07s	21.9	—	—	—
27Bthink	Q2_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	chat	1	44.4	258ms	22.5	—	—	—
27Bthink	Q2_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	rag	1	43.8	830ms	22.8	—	—	—
27Bthink	Q2_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	codegen	1	43.6	334ms	22.9	—	—	—
27Bthink	Q2_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	1	43.5	588ms	23.0	—	—	—
27Bthink	Q2_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	4	43.5	18.71s	23.0	—	—	—
27Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	chat	1	42.6	235ms	23.5	—	—	—
27Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	rag	1	42.2	759ms	23.7	—	—	—
27Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	codegen	1	42.1	312ms	23.7	—	—	—
27Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	1	42.0	519ms	23.8	—	—	—
27Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	4	41.9	19.50s	23.8	—	—	—
27Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	chat	1	41.7	237ms	24.0	—	—	—
27Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	rag	1	41.4	857ms	24.2	—	—	—
27Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	codegen	1	41.3	302ms	24.2	—	—	—
27Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	1	41.2	500ms	24.3	—	—	—
27Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	4	41.2	19.80s	24.3	—	—	—
27Bthink	Q3_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	chat	1	39.9	244ms	25.1	—	—	—
27Bthink	Q3_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	rag	1	39.2	778ms	25.5	—	—	—
27Bthink	Q3_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	codegen	1	38.7	296ms	25.9	—	—	—
27Bthink	Q3_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	1	38.5	488ms	26.0	—	—	—
27Bthink	Q3_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	4	38.4	21.12s	26.0	—	—	—
27Bthink	Q5_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	chat	1	38.1	256ms	26.2	—	—	—
27Bthink	Q5_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	rag	1	37.8	751ms	26.4	—	—	—
27Bthink	Q5_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	codegen	1	37.7	348ms	26.5	—	—	—
27Bthink	Q3_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	chat	1	37.7	235ms	26.5	—	—	—
27Bthink	Q5_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	1	37.6	533ms	26.6	—	—	—
27Bthink	Q5_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	4	37.6	21.75s	26.6	—	—	—
27Bthink	Q5_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	chat	1	37.4	251ms	26.7	—	—	—
27Bthink	Q3_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	rag	1	37.2	802ms	26.9	—	—	—
27Bthink	Q5_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	rag	1	37.1	895ms	26.9	—	—	—
27Bthink	Q5_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	codegen	1	37.1	323ms	27.0	—	—	—
27Bthink	Q3_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	codegen	1	37.1	344ms	27.0	—	—	—
27Bthink	Q5_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	1	37.0	516ms	27.0	—	—	—
27Bthink	Q5_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	4	37.0	21.54s	27.0	—	—	—
27Bthink	Q3_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	1	37.0	505ms	27.1	—	—	—
27Bthink	Q3_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	4	36.9	21.69s	27.1	—	—	—
27Bthink	Q6_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	chat	1	32.9	251ms	30.4	—	—	—
27Bthink	Q6_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	rag	1	32.6	804ms	30.7	—	—	—
27Bthink	Q6_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	codegen	1	32.5	326ms	30.8	—	—	—
27Bthink	Q2_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	chat	1	26.3	250ms	38.0	—	—	—
27Bthink	Q2_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	rag	1	24.8	947ms	40.3	—	—	—
27Bthink	Q2_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	codegen	1	24.5	367ms	40.9	—	—	—
27Bthink	Q2_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	agent	1	24.4	614ms	41.1	—	—	—
27Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	chat	1	23.0	255ms	43.5	—	—	—
27Bthink	Q3_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	chat	1	22.3	259ms	44.9	—	—	—
27Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	rag	1	22.2	931ms	45.0	—	—	—
27Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	codegen	1	21.8	375ms	45.9	—	—	—
27Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	agent	1	21.7	626ms	46.0	—	—	—
27Bthink	Q3_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	rag	1	21.4	938ms	46.8	—	—	—
27Bthink	Q3_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	codegen	1	21.1	356ms	47.3	—	—	—
27Bthink	Q3_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	agent	1	21.1	595ms	47.4	—	—	—
27Bthink	Q5_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	chat	1	19.9	268ms	50.1	—	—	—
27Bthink	Q5_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	rag	1	19.3	1.20s	51.8	—	—	—
27Bthink	Q5_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	codegen	1	19.1	375ms	52.4	—	—	—
27Bthink	Q5_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	agent	1	19.0	1.16s	52.7	—	—	—
35B-A3Bthink	Q4_K_XL	legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp b1203 (rocm)	baseline	agent	4	18.9	1.29s	53.0	—	—	—
27Bthink	Q3_K_M	legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp rocm-4f13cb7 (rocm)	baseline	chat	1	14.4	333ms	69.4	—	—	—
27Bthink	Q3_K_M	legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp rocm-4f13cb7 (rocm)	baseline	codegen	1	14.3	413ms	69.7	—	—	—
27Bthink	Q3_K_M	legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp rocm-4f13cb7 (rocm)	baseline	agent	4	14.3	52.94s	69.9	—	—	—
27Bthink	Q3_K_M	legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp rocm-4f13cb7 (rocm)	baseline	rag	1	14.3	1.53s	70.0	—	—	—
27Bthink	Q3_K_M	legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp rocm-4f13cb7 (rocm)	baseline	agent	1	14.3	277ms	70.0	—	—	—
27Bthink	Q2_K	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	agent	4	12.8	4.01s	77.9	—	—	—
27Bthink	Q4_K_M	legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp rocm-4f13cb7 (rocm)	baseline	chat	1	12.1	352ms	82.4	—	—	—
27Bthink	Q4_K_M	legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp rocm-4f13cb7 (rocm)	baseline	codegen	1	12.1	435ms	82.7	—	—	—
27Bthink	Q4_K_M	legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp rocm-4f13cb7 (rocm)	baseline	agent	1	12.1	289ms	82.8	—	—	—
27Bthink	Q4_K_M	legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp rocm-4f13cb7 (rocm)	baseline	agent	4	12.1	62.69s	82.8	—	—	—
27Bthink	Q4_K_M	legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp rocm-4f13cb7 (rocm)	baseline	rag	1	12.1	1.53s	82.9	—	—	—
27Bthink	Q4_K_XL	legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp b1203 (rocm)	baseline	codegen	1	12.0	417ms	83.5	—	—	—
27Bthink	Q4_K_XL	legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp b1203 (rocm)	baseline	rag	1	12.0	1.73s	83.6	—	—	—
27Bthink	Q4_K_XL	legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp b1203 (rocm)	baseline	agent	1	12.0	1.75s	83.6	—	—	—
27Bthink	Q3_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	agent	4	11.5	3.89s	87.3	—	—	—
27Bthink	Q4_K_M	legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	agent	4	11.4	3.93s	87.5	—	—	—
27Bthink	Q5_K_M	legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp rocm-4f13cb7 (rocm)	baseline	chat	1	10.7	358ms	93.5	—	—	—
27Bthink	Q5_K_M	legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp rocm-4f13cb7 (rocm)	baseline	codegen	1	10.7	440ms	93.8	—	—	—
27Bthink	Q5_K_M	legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp rocm-4f13cb7 (rocm)	baseline	rag	1	10.7	1.61s	93.8	—	—	—
27Bthink	Q5_K_M	legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp rocm-4f13cb7 (rocm)	baseline	agent	4	10.7	70.91s	93.8	—	—	—
27Bthink	Q5_K_M	legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp rocm-4f13cb7 (rocm)	baseline	agent	1	10.7	294ms	93.8	—	—	—
27Bthink	Q4_K_XL	legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp b1203 (rocm)	baseline	agent	4	5.6	2.12s	178.9	—	—	—

Decode tok/s

Headline speed metric

TTFT / TPOT

Latency context

Raw vs workload

Separate comparison contracts

Notes badge key

hardware comparable

Use these rows for GPU-to-GPU comparisons when the model, quant, backend, driver family, power policy, and benchmark shape match closely.

stack comparable

Use these rows to compare a similar software stack. They are useful, but backend, server path, driver, cache, or power settings may still influence the number.

stack realistic

Treat these as real workload measurements, not pure hardware rankings. They include prompt mix, API/server overhead, cache behavior, and local software details.

legacyOlder workload harness row.

350 W capRecorded GPU power limit.

drv 590GPU driver branch.

reasoningReasoning-token model.

Metric guide

Decode tok/s - Generation rate. Raw rows come from the engine benchmark; API rows use token intervals when available.

TTFT - Time to first token. This includes prompt processing and server/API overhead.

TPOT / ITL - Time per output token after the first token. Lower is better.

Raw Engine - llama-bench style cases intended for hardware-normalized comparison across rigs.

Workload / API - Stack-realistic measurements that include backend, server, cache, driver, and prompt behavior.

Power badges - A cap badge shows the recorded power limit. The row metadata records the cap relative to the recorded max.