Qwen3.6 35B-A3B-MTP

Q4_K_M·35B params·GGUF

reasoning

intelligence: see on Artificial Analysis →

checkpoint: unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M

All runs (30)


legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	MTP n=2	codegen	1	169.0	169.0	371.9	—	172ms	0.1	—	62	1000	5.92s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	MTP n=2	agent	1	162.7	162.7	2760.7	—	231ms	0.1	—	599	500	3.07s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	MTP n=3	agent	1	161.4	161.4	2506.1	—	239ms	0.1	—	599	500	3.10s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	MTP n=3	codegen	1	160.4	160.4	332.9	—	177ms	0.1	—	62	1000	6.24s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline	chat	1	148.6	120.0	236.4	—	127ms	6.7	—	30	100	834ms	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline	codegen	1	148.4	135.6	424.9	—	148ms	6.7	—	62	1000	7.38s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	MTP n=3	chat	1	148.3	148.3	208.5	—	149ms	0.1	—	30	100	674ms	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline	rag	1	148.1	110.6	2719.8	—	338ms	6.8	—	842	200	1.81s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline	agent	1	148.1	126.9	2637.4	—	227ms	6.8	—	599	500	3.94s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline	agent	4	147.6	55.0	147.2	—	5.62s	6.8	—	599	500	9.37s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	MTP n=2	chat	1	143.2	143.2	213.6	—	140ms	0.1	—	30	100	698ms	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	MTP n=3	rag	1	136.6	136.6	2351.2	—	399ms	0.1	—	842	200	1.46s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	MTP n=2	rag	1	122.1	122.1	2678.8	—	400ms	0.1	—	842	200	1.64s	0.000 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp 4f13cb7-mtp (rocm)	MTP n=3	agent	1	70.6	70.6	4976.2	—	122ms	0.0	—	599	500	7.08s	0.014 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp 4f13cb7-mtp (rocm)	MTP n=2	agent	1	70.0	70.0	5621.5	—	107ms	0.0	—	599	500	7.14s	0.017 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp 4f13cb7-mtp (rocm)	MTP n=3	codegen	1	69.9	69.9	282.7	—	227ms	0.0	—	62	1000	14.30s	0.025 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp 4f13cb7-mtp (rocm)	MTP n=2	codegen	1	69.4	69.4	313.9	—	207ms	0.1	—	62	1000	14.41s	0.030 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp 4f13cb7-mtp (rocm)	MTP n=3	chat	1	69.4	69.4	195.1	—	158ms	0.0	—	30	100	1.44s	0.007 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	MTP n=2	agent	4	68.4	68.4	140.4	—	4.79s	0.1	—	599	500	7.56s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	MTP n=3	agent	4	68.4	68.4	177.6	—	4.57s	0.1	—	599	500	7.52s	0.000 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp 4f13cb7-mtp (rocm)	MTP n=2	chat	1	65.4	65.4	200.3	—	157ms	0.0	—	30	100	1.53s	0.007 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp 4f13cb7-mtp (rocm)	MTP n=3	rag	1	63.4	63.4	1129.2	—	627ms	0.0	—	842	200	3.15s	0.002 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp 4f13cb7-mtp (rocm)	MTP n=2	rag	1	60.8	60.8	1153.1	—	610ms	0.0	—	842	200	3.29s	0.002 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp 4f13cb7-mtp (rocm)	baseline	codegen	1	52.9	52.3	331.9	—	189ms	18.9	—	62	1000	19.12s	0.014 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp 4f13cb7-mtp (rocm)	baseline	agent	4	52.8	21.8	50.0	—	14.42s	19.0	—	599	500	23.90s	0.034 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp 4f13cb7-mtp (rocm)	baseline	agent	1	52.7	52.1	5654.6	—	106ms	19.0	—	599	500	9.59s	0.009 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp 4f13cb7-mtp (rocm)	baseline	chat	1	52.7	49.5	216.8	—	139ms	19.0	—	30	100	2.02s	0.003 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp 4f13cb7-mtp (rocm)	baseline	rag	1	52.5	46.2	1279.9	—	538ms	19.0	—	842	200	4.33s	0.020 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp 4f13cb7-mtp (rocm)	MTP n=3	agent	4	30.1	30.1	89.5	—	10.24s	0.0	—	599	500	17.10s	0.010 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp 4f13cb7-mtp (rocm)	MTP n=2	agent	4	29.7	29.7	66.0	—	10.48s	0.0	—	599	500	17.41s	0.065 GiB

Environment

GeForce RTX 3090 · 24 GiB

cpuAMD EPYC 7302P 16-Core Processor

gpuNVIDIA GeForce RTX 3090

archNVIDIA

vram24 GiB (system 64.0 GiB)

power450 W / 450 W max

pcieGen 4 x16 / Gen 4 x16 max

clocksgfx 1980/2100 MHz · mem 9501 MHz

temp42°C idle · 69°C peak

peak draw383 W

hardware probes

copy 42% of theoryFP16 peak 65.4 TFcopy/math flat across caps

384-bit9751 MHz82 SM/CU

Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.

cap	theory	copy	fp16	bf16
200 W	936 GB/s	391 GB/s	65.4 TF	65.4 TF
300 W	936 GB/s	391 GB/s	65.4 TF	65.3 TF
450 W	936 GB/s	391 GB/s	65.4 TF	65.4 TF

compute: 8.6

backendllama.cpp cuda-4f13cb7 (cuda)

osUbuntu 24.04 LTS

kernel6.17.13-7-pve

driverNVIDIA 590.48.01 + CUDA 13.1

libc2.39

python3.12.3

llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64

build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release

runs/cell5

warmups2

endpoint/v1/chat/completions

streamingtrue

Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

cpuAMD RYZEN AI MAX+ 395 w/ Radeon 8060S

gpuAMD Radeon 8060S

archStrix Halo (gfx1151)

vram96 GiB (system 31.1 GiB, unified)

hardware probes

copy 41% of theoryFP16 peak 30.3 TF

256-bit8000 MHz20 SM/CU