Qwen3.6 27B-MTP

Q4_K_M·27B params·GGUF

reasoning

intelligence: see on Artificial Analysis →

checkpoint: unsloth/Qwen3.6-27B-MTP-GGUF:Q4_K_M

All runs (127)


legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	MTP n=2	codegen	1	63.7	63.7	195.4	—	335ms	0.1	—	62	1000	15.69s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	MTP n=2	agent	1	61.7	61.7	1157.5	—	525ms	0.1	—	599	500	8.11s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	MTP n=2	chat	1	59.5	59.5	119.7	—	259ms	0.1	—	30	100	1.68s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	MTP n=3	codegen	1	59.2	59.2	175.8	—	372ms	0.1	—	62	1000	16.88s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	MTP n=3	chat	1	58.7	58.7	123.8	—	259ms	0.1	—	30	100	1.70s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	MTP n=3	agent	1	57.8	57.8	1208.4	—	537ms	0.1	—	599	500	8.66s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	MTP n=2	rag	1	55.9	55.9	1186.1	—	879ms	0.1	—	842	200	3.58s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	MTP n=3	rag	1	54.6	54.6	1096.3	—	934ms	0.1	—	842	200	3.66s	0.000 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	MTP n=3	ctx1k_answer	1	50.2	50.2	656.8	—	1.48s	0.1	—	969	500	9.96s	0.000 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	MTP n=2	ctx1k_answer	1	47.5	47.5	671.1	—	1.45s	0.1	—	969	500	10.54s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline	chat	1	42.7	38.7	127.0	—	238ms	23.4	—	30	100	2.58s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline	rag	1	42.3	35.6	1303.0	—	738ms	23.6	—	842	200	5.62s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline	codegen	1	42.2	40.4	188.9	—	355ms	23.7	—	62	1000	24.76s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline	agent	1	42.1	39.3	1196.7	—	505ms	23.8	—	599	500	12.73s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline	agent	4	41.9	16.2	35.9	—	19.73s	23.8	—	599	500	32.01s	0.000 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	baseline	ctx1k_probe	1	40.1	6.3	870.2	—	1.09s	24.9	—	951	8	1.27s	0.000 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	baseline	ctx1k_answer	1	39.5	35.2	824.9	—	1.18s	25.3	—	969	500	14.21s	0.000 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	baseline	ctx4k_answer	1	39.1	30.8	1238.8	—	3.05s	25.5	—	3778	500	16.22s	0.000 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	MTP n=3	ctx4k_answer	1	38.0	38.0	827.9	—	4.57s	0.1	—	3778	500	13.15s	0.000 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	baseline	ctx16k_probe	1	37.8	0.7	1350.7	—	11.21s	26.5	—	15136	8	11.40s	0.000 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	baseline	ctx16k_answer	1	37.6	20.2	1344.6	—	11.27s	26.6	—	15154	500	24.81s	0.000 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	MTP n=2	ctx4k_answer	1	36.7	36.7	823.7	—	4.59s	0.1	—	3778	500	13.62s	0.000 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	baseline	ctx4k_probe	1	36.5	2.4	1256.4	—	3.05s	27.4	—	3835	8	3.29s	0.000 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	baseline	ctx32k_probe	1	36.1	0.3	1269.9	—	23.85s	27.7	—	30287	8	24.05s	0.000 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	baseline	ctx32k_answer	1	35.7	13.1	1269.6	—	23.87s	28.0	—	30305	500	38.03s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 4f13cb7-mtp (cuda)	mtp-3-pl-200w	chat	1	34.2	34.2	109.6	—	283ms	0.1	—	30	100	2.92s	0.000 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	baseline	ctx64k_answer	1	32.8	7.1	1107.2	—	54.68s	30.5	—	60601	500	70.34s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 4f13cb7-mtp (cuda)	mtp-2-pl-200w	chat	1	32.0	32.0	113.4	—	271ms	0.1	—	30	100	3.12s	0.000 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	baseline	ctx64k_probe	1	32.0	0.1	1098.8	—	55.14s	31.2	—	60585	8	55.37s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 4f13cb7-mtp (cuda)	mtp-2-pl-200w	codegen	1	31.8	31.8	185.9	—	377ms	0.1	—	62	1000	31.41s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 4f13cb7-mtp (cuda)	mtp-3-pl-200w	rag	1	31.2	31.2	856.1	—	1.05s	0.1	—	842	200	6.40s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 4f13cb7-mtp (cuda)	mtp-3-pl-200w	agent	1	31.2	31.2	964.4	—	621ms	0.1	—	599	500	16.03s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 4f13cb7-mtp (cuda)	mtp-3-pl-200w	codegen	1	31.1	31.1	182.4	—	384ms	0.1	—	62	1000	32.18s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 4f13cb7-mtp (cuda)	mtp-2-pl-200w	agent	1	30.4	30.4	972.1	—	616ms	0.0	—	599	500	16.44s	0.000 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	baseline	ctx100k_probe	1	29.9	0.1	939.3	—	100.77s	33.4	—	94645	8	101.01s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 4f13cb7-mtp (cuda)	mtp-2-pl-200w	rag	1	29.8	29.8	860.6	—	1.05s	0.0	—	842	200	6.71s	0.000 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	baseline	ctx100k_answer	1	29.1	4.2	937.4	—	100.91s	34.4	—	94590	500	118.33s	0.000 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	baseline	ctx128k_answer	1	27.4	3.1	852.8	—	142.02s	36.5	—	121117	500	160.72s	0.000 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	baseline	ctx128k_probe	1	27.2	0.1	856.6	—	141.38s	36.8	—	121099	8	141.66s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	MTP n=2	agent	4	25.3	25.3	57.3	—	12.42s	0.1	—	599	500	20.39s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	MTP n=3	agent	4	24.0	24.0	59.7	—	13.43s	0.1	—	599	500	21.54s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 4f13cb7-mtp (cuda)	baseline-pl-200w	chat	1	22.8	21.1	112.9	—	266ms	43.9	—	30	100	4.74s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 4f13cb7-mtp (cuda)	baseline-pl-200w	rag	1	22.3	20.0	1035.7	—	923ms	44.9	—	842	200	10.01s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 4f13cb7-mtp (cuda)	baseline-pl-200w	codegen	1	21.8	21.5	191.7	—	345ms	45.9	—	62	1000	46.44s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 4f13cb7-mtp (cuda)	baseline-pl-200w	agent	1	21.6	21.0	984.2	—	609ms	46.2	—	599	500	23.77s	0.000 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	MTP n=3	ctx1k_answer	1	21.3	21.3	286.6	—	3.38s	0.0	—	969	500	23.50s	0.013 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp 4f13cb7-mtp (rocm)	MTP n=3	chat	1	21.2	21.2	78.2	—	386ms	3.0	—	30	100	4.72s	0.006 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp 4f13cb7-mtp (rocm)	MTP n=3	codegen	1	20.6	20.6	133.5	—	475ms	0.1	—	62	1000	48.47s	0.041 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp 4f13cb7-mtp (rocm)	MTP n=3	agent	1	20.4	20.4	1984.4	—	302ms	0.0	—	599	500	24.51s	0.022 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp 4f13cb7-mtp (rocm)	MTP n=2	codegen	1	20.0	20.0	136.7	—	460ms	0.1	—	62	1000	49.96s	0.044 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp 4f13cb7-mtp (rocm)	MTP n=3	rag	1	19.9	19.9	426.2	—	1.69s	0.0	—	842	200	10.05s	0.011 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp 4f13cb7-mtp (rocm)	MTP n=2	chat	1	19.7	19.7	81.6	—	375ms	0.0	—	30	100	5.07s	0.006 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp 4f13cb7-mtp (rocm)	MTP n=2	agent	1	19.4	19.4	1949.4	—	315ms	0.0	—	599	500	25.73s	0.024 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	MTP n=2	ctx1k_answer	1	19.3	19.3	287.4	—	3.37s	0.0	—	969	500	25.91s	0.013 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp 4f13cb7-mtp (rocm)	MTP n=2	rag	1	18.8	18.8	432.5	—	1.68s	0.0	—	842	200	10.66s	0.012 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	MTP n=2	ctx16k_answer	1	17.2	17.2	791.3	—	19.15s	0.1	—	15154	500	29.01s	0.000 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	MTP n=3	ctx16k_answer	1	17.2	17.2	790.1	—	19.18s	0.1	—	15154	500	29.05s	0.000 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	MTP n=3	ctx4k_answer	1	15.3	15.3	303.8	—	12.44s	0.0	—	3778	500	32.64s	0.012 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	MTP n=2	ctx4k_answer	1	14.1	14.1	302.7	—	12.48s	0.0	—	3778	500	35.35s	0.014 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp 4f13cb7-mtp (rocm)	baseline	chat	1	12.1	11.7	87.3	—	345ms	82.4	—	30	100	8.52s	0.003 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp 4f13cb7-mtp (rocm)	baseline	codegen	1	12.1	12.0	146.7	—	428ms	82.7	—	62	1000	83.17s	0.016 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp 4f13cb7-mtp (rocm)	baseline	agent	1	12.1	12.0	2090.1	—	287ms	82.8	—	599	500	41.69s	0.009 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	baseline	ctx1k_probe	1	12.1	2.3	336.1	—	2.83s	82.8	—	951	8	3.42s	0.002 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp 4f13cb7-mtp (rocm)	baseline	agent	4	12.1	5.0	12.6	—	62.69s	82.8	—	599	500	104.08s	0.037 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	baseline	ctx1k_answer	1	12.1	11.3	319.1	—	3.04s	82.8	—	969	500	44.42s	0.005 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp 4f13cb7-mtp (rocm)	baseline	rag	1	12.1	11.1	470.6	—	1.51s	82.8	—	842	200	18.01s	0.006 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	baseline	ctx4k_answer	1	11.9	9.4	339.2	—	11.14s	83.7	—	3778	500	53.00s	0.006 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	baseline	ctx4k_probe	1	11.7	0.7	340.5	—	11.26s	85.8	—	3835	8	11.89s	0.004 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	baseline	ctx16k_answer	1	11.5	5.4	311.1	—	48.72s	86.7	—	15154	500	92.05s	0.006 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	baseline	ctx16k_probe	1	11.5	0.2	312.4	—	48.44s	87.2	—	15136	8	49.07s	0.002 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 4f13cb7-mtp (cuda)	baseline-pl-200w	agent	4	11.2	9.8	—	—	4.04s	89.5	—	—	341	34.74s	0.040 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	baseline	ctx32k_answer	1	11.0	3.2	276.3	—	109.69s	90.9	—	30305	500	155.13s	0.005 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	baseline	ctx32k_probe	1	11.0	0.1	276.3	—	109.63s	91.1	—	30287	8	110.28s	0.001 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	baseline	ctx64k_answer	1	10.1	1.6	224.9	—	269.20s	99.3	—	60601	500	318.82s	0.005 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	baseline	ctx64k_probe	1	10.0	0.0	224.8	—	269.46s	99.5	—	60585	8	270.17s	0.002 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	MTP n=3	ctx32k_answer	1	9.9	9.9	737.1	—	41.12s	0.1	—	30305	500	50.70s	0.000 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	MTP n=2	ctx32k_answer	1	9.8	9.8	739.3	—	40.99s	0.1	—	30305	500	51.28s	0.000 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	baseline	ctx100k_answer	1	9.2	0.9	185.8	—	509.02s	108.6	—	94590	500	563.37s	0.006 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	baseline	ctx100k_probe	1	9.2	0.0	185.8	—	509.52s	108.8	—	94645	8	510.31s	0.002 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	baseline	ctx128k_answer	1	8.6	0.6	163.8	—	739.48s	115.9	—	121117	500	797.46s	0.005 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	baseline	ctx128k_probe	1	8.6	0.0	163.6	—	740.26s	115.9	—	121099	8	741.09s	0.001 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp 4f13cb7-mtp (rocm)	MTP n=3	agent	4	8.4	8.4	18.8	—	36.60s	0.1	—	599	500	61.49s	0.090 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp 4f13cb7-mtp (rocm)	MTP n=2	agent	4	8.2	8.2	21.9	—	38.32s	0.1	—	599	500	63.14s	0.097 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	MTP n=3	ctx16k_answer	1	6.4	6.4	280.1	—	54.12s	0.0	—	15154	500	77.75s	0.013 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	MTP n=2	ctx16k_answer	1	6.3	6.3	280.2	—	54.09s	0.0	—	15154	500	79.44s	0.014 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	MTP n=3	ctx1k_probe	1	5.4	5.4	693.4	—	1.37s	0.1	—	951	8	1.49s	0.010 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	MTP n=2	ctx1k_probe	1	5.3	5.3	694.6	—	1.37s	0.1	—	951	8	1.50s	0.000 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	MTP n=3	ctx64k_answer	1	4.7	4.7	631.4	—	95.98s	0.1	—	60601	500	105.45s	0.000 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	MTP n=2	ctx64k_answer	1	4.7	4.7	632.7	—	95.67s	0.1	—	60601	500	105.88s	0.000 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	MTP n=3	ctx32k_answer	1	3.4	3.4	250.0	—	121.24s	0.0	—	30305	500	147.14s	0.013 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	MTP n=2	ctx32k_answer	1	3.4	3.4	249.5	—	121.48s	0.1	—	30305	500	148.56s	0.015 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	MTP n=3	ctx100k_answer	1	2.7	2.7	536.0	—	176.47s	0.1	—	94590	500	187.99s	0.000 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	MTP n=2	ctx100k_answer	1	2.7	2.7	535.0	—	176.82s	0.1	—	94590	500	188.56s	0.000 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	MTP n=3	ctx1k_probe	1	2.3	2.3	299.4	—	3.18s	0.0	—	951	8	3.50s	0.003 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	MTP n=2	ctx1k_probe	1	2.3	2.3	299.5	—	3.17s	0.0	—	951	8	3.55s	0.003 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	MTP n=3	ctx128k_answer	1	1.9	1.9	478.8	—	252.98s	0.1	—	121117	500	263.24s	0.000 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	MTP n=2	ctx128k_answer	1	1.9	1.9	476.8	—	254.00s	0.1	—	121117	500	265.22s	0.000 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	MTP n=3	ctx4k_probe	1	1.7	1.7	833.5	—	4.60s	0.1	—	3835	8	4.72s	0.000 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	MTP n=2	ctx4k_probe	1	1.7	1.7	821.7	—	4.67s	0.1	—	3835	8	4.81s	0.000 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	MTP n=3	ctx64k_answer	1	1.5	1.5	204.8	—	295.96s	0.0	—	60601	500	322.74s	0.010 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	MTP n=2	ctx64k_answer	1	1.5	1.5	204.7	—	296.05s	0.0	—	60601	500	326.27s	0.009 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	MTP n=3	ctx100k_answer	1	0.9	0.9	170.1	—	556.16s	0.0	—	94590	500	585.80s	0.010 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	MTP n=2	ctx100k_answer	1	0.8	0.8	169.4	—	558.28s	0.0	—	94590	500	590.24s	0.010 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	MTP n=3	ctx4k_probe	1	0.6	0.6	302.9	—	12.66s	0.0	—	3835	8	12.98s	0.002 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	MTP n=2	ctx4k_probe	1	0.6	0.6	304.1	—	12.61s	3.6	—	3835	8	13.02s	0.001 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	MTP n=3	ctx128k_answer	1	0.6	0.6	150.2	—	806.44s	0.0	—	121117	500	837.93s	0.010 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	MTP n=2	ctx128k_answer	1	0.6	0.6	149.5	—	810.10s	0.0	—	121117	500	844.87s	0.010 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	MTP n=3	ctx16k_probe	1	0.4	0.4	795.0	—	19.04s	0.1	—	15136	8	19.16s	0.000 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	MTP n=2	ctx16k_probe	1	0.4	0.4	793.2	—	19.08s	0.0	—	15136	8	19.24s	0.000 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	MTP n=3	ctx32k_probe	1	0.2	0.2	739.3	—	40.97s	0.1	—	30287	8	41.10s	0.000 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	MTP n=2	ctx32k_probe	1	0.2	0.2	738.5	—	41.01s	0.0	—	30287	8	41.17s	0.000 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	MTP n=2	ctx16k_probe	1	0.1	0.1	281.4	—	53.80s	0.1	—	15136	8	54.24s	0.002 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	MTP n=3	ctx16k_probe	1	0.1	0.1	279.9	—	54.08s	0.0	—	15136	8	54.42s	0.003 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	MTP n=3	ctx64k_probe	1	0.1	0.1	635.1	—	95.39s	0.1	—	60585	8	95.58s	0.000 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	MTP n=2	ctx64k_probe	1	0.1	0.1	633.4	—	95.65s	0.0	—	60585	8	95.82s	0.000 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	MTP n=2	ctx32k_probe	1	0.1	0.1	250.1	—	121.13s	0.0	—	30287	8	121.58s	0.003 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	MTP n=3	ctx32k_probe	1	0.1	0.1	249.8	—	121.26s	0.0	—	30287	8	121.61s	0.002 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	MTP n=2	ctx100k_probe	1	0.0	0.0	537.7	—	176.03s	0.1	—	94645	8	176.22s	0.000 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	MTP n=3	ctx100k_probe	1	0.0	0.0	536.4	—	176.45s	0.0	—	94645	8	176.66s	0.000 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	MTP n=2	ctx128k_probe	1	0.0	0.0	477.9	—	253.42s	0.1	—	121099	8	253.61s	0.000 GiB
legacy	stack comparable	2× GeForce RTX 3090 · 24 GiB eachcap 200 W × 2drv 595	llama.cpp direct (cuda)	MTP n=3	ctx128k_probe	1	0.0	0.0	479.0	—	252.82s	0.0	—	121099	8	252.96s	0.000 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	MTP n=2	ctx64k_probe	1	0.0	0.0	204.8	—	295.80s	0.0	—	60585	8	296.33s	0.003 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	MTP n=3	ctx64k_probe	1	0.0	0.0	204.8	—	295.90s	0.0	—	60585	8	296.29s	0.003 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	MTP n=2	ctx100k_probe	1	0.0	0.0	169.7	—	557.77s	0.0	—	94645	8	558.32s	0.002 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	MTP n=3	ctx100k_probe	1	0.0	0.0	169.9	—	557.01s	0.0	—	94645	8	557.61s	0.003 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	MTP n=2	ctx128k_probe	1	0.0	0.0	149.9	—	807.88s	0.0	—	121099	8	808.42s	0.003 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7	llama.cpp direct (rocm)	MTP n=3	ctx128k_probe	1	0.0	0.0	150.2	—	806.45s	0.0	—	121099	8	806.93s	0.002 GiB

Environment

GeForce RTX 3090 · 24 GiB

cpuAMD EPYC 7302P 16-Core Processor

gpuNVIDIA GeForce RTX 3090

archNVIDIA

vram24 GiB (system 64.0 GiB)

power200 W / 450 W max(44% cap)

hardware probes

copy 42% of theoryFP16 peak 65.4 TFcopy/math flat across caps

384-bit9751 MHz82 SM/CU

Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.

cap	theory	copy	fp16	bf16
200 W	936 GB/s	391 GB/s	65.4 TF	65.4 TF
300 W	936 GB/s	391 GB/s	65.4 TF	65.3 TF
450 W	936 GB/s	391 GB/s	65.4 TF	65.4 TF

compute: 8.6

backendllama.cpp 4f13cb7-mtp (cuda)

osUbuntu 24.04 LTS

kernel6.17.13-7-pve

driver590.48.01

python3.12.3

runs/cell5

warmups2

endpoint/v1/chat/completions

streamingtrue

GeForce RTX 3090 · 24 GiB

cpuAMD EPYC 7302P 16-Core Processor

gpuNVIDIA GeForce RTX 3090

archNVIDIA

vram24 GiB (system 64.0 GiB)

power450 W / 450 W max

pcieGen 4 x16 / Gen 4 x16 max

clocksgfx 1980/2100 MHz · mem 9501 MHz

temp38°C idle · 83°C peak

peak draw436 W

backendllama.cpp cuda-4f13cb7 (cuda)

osUbuntu 24.04 LTS

kernel6.17.13-7-pve

driverNVIDIA 590.48.01 + CUDA 13.1

libc2.39

python3.12.3

llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64

build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release

runs/cell5

warmups2

endpoint/v1/chat/completions

streamingtrue

2× GeForce RTX 3090 · 24 GiB each

cpuAMD EPYC 7302P 16-Core Processor

gpuNVIDIA GeForce RTX 3090 × 2

archNVIDIA

vram48 GiB (system 64.0 GiB)

power200 W × 2 / 450 W × 2 max(44% cap)

pcieGen 4 x16 / Gen 4 x16 max

clocksgfx 1800/2100 MHz · mem 9501 MHz

temp41°C idle · 53°C peak

peak draw195 W

backendllama.cpp direct (cuda)

osUbuntu 24.04 LTS

driverNVIDIA 595.71.05 + CUDA 13.2

libc2.39

python3.12.3

llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64

build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release

runs/cell3

warmups1

endpoint/v1/chat/completions

streamingtrue

Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

cpuAMD RYZEN AI MAX+ 395 w/ Radeon 8060S

gpuAMD Radeon 8060S

archStrix Halo (gfx1151)

vram96 GiB (system 31.1 GiB, unified)

pcieGen 4 x16 / Gen 4 x16 max

clocksgfx 1158 MHz · mem 1000 MHz

temp47°C idle · 77°C peak

peak draw103 W

hardware probes

copy 41% of theoryFP16 peak 30.3 TF

256-bit8000 MHz20 SM/CU

Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.

cap	theory	copy	fp16	bf16
fixed	256 GB/s	106 GB/s	30.3 TF	-

compute: 11.5

backendllama.cpp direct (rocm)

osUbuntu 24.04 LTS

driverROCm 7.2.3

libc2.39

python3.12.3

llama.cppversion: 1 (4f13cb7) built with Clang 22.0.0 for Linux x86_64

build flagsGGML_HIP=ON AMDGPU_TARGETS=gfx1151 CMAKE_BUILD_TYPE=Release

runs/cell3

warmups1

endpoint/v1/chat/completions

streamingtrue

Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

cpuAMD RYZEN AI MAX+ 395 w/ Radeon 8060S

gpuAMD Radeon 8060S

archStrix Halo (gfx1151)

vram96 GiB (system 31.1 GiB, unified)

backendllama.cpp 4f13cb7-mtp (rocm)

osUbuntu 24.04 LTS

kernel7.0.2-2-pve

python3.12.3

runs/cell5

warmups2

endpoint/v1/chat/completions

streamingtrue