Skip to content

Qwen3.6 35B-A3B-MTP

Q4_K_M·35B params·GGUF
reasoning
checkpoint: unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_M

All runs (30)

legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)MTP n=2codegen1
169.0
169.0371.9172ms0.16210005.92s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)MTP n=2agent1
162.7
162.72760.7231ms0.15995003.07s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)MTP n=3agent1
161.4
161.42506.1239ms0.15995003.10s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)MTP n=3codegen1
160.4
160.4332.9177ms0.16210006.24s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baselinechat1
148.6
120.0236.4127ms6.730100834ms0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baselinecodegen1
148.4
135.6424.9148ms6.76210007.38s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)MTP n=3chat1
148.3
148.3208.5149ms0.130100674ms0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baselinerag1
148.1
110.62719.8338ms6.88422001.81s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baselineagent1
148.1
126.92637.4227ms6.85995003.94s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baselineagent4
147.6
55.0147.25.62s6.85995009.37s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)MTP n=2chat1
143.2
143.2213.6140ms0.130100698ms0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)MTP n=3rag1
136.6
136.62351.2399ms0.18422001.46s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)MTP n=2rag1
122.1
122.12678.8400ms0.18422001.64s0.000 GiB
legacystack comparable
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp 4f13cb7-mtp (rocm)MTP n=3agent1
70.6
70.64976.2122ms0.05995007.08s0.014 GiB
legacystack comparable
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp 4f13cb7-mtp (rocm)MTP n=2agent1
70.0
70.05621.5107ms0.05995007.14s0.017 GiB
legacystack comparable
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp 4f13cb7-mtp (rocm)MTP n=3codegen1
69.9
69.9282.7227ms0.062100014.30s0.025 GiB
legacystack comparable
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp 4f13cb7-mtp (rocm)MTP n=2codegen1
69.4
69.4313.9207ms0.162100014.41s0.030 GiB
legacystack comparable
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp 4f13cb7-mtp (rocm)MTP n=3chat1
69.4
69.4195.1158ms0.0301001.44s0.007 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)MTP n=2agent4
68.4
68.4140.44.79s0.15995007.56s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)MTP n=3agent4
68.4
68.4177.64.57s0.15995007.52s0.000 GiB
legacystack comparable
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp 4f13cb7-mtp (rocm)MTP n=2chat1
65.4
65.4200.3157ms0.0301001.53s0.007 GiB
legacystack comparable
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp 4f13cb7-mtp (rocm)MTP n=3rag1
63.4
63.41129.2627ms0.08422003.15s0.002 GiB
legacystack comparable
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp 4f13cb7-mtp (rocm)MTP n=2rag1
60.8
60.81153.1610ms0.08422003.29s0.002 GiB
legacystack comparable
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp 4f13cb7-mtp (rocm)baselinecodegen1
52.9
52.3331.9189ms18.962100019.12s0.014 GiB
legacystack comparable
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp 4f13cb7-mtp (rocm)baselineagent4
52.8
21.850.014.42s19.059950023.90s0.034 GiB
legacystack comparable
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp 4f13cb7-mtp (rocm)baselineagent1
52.7
52.15654.6106ms19.05995009.59s0.009 GiB
legacystack comparable
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp 4f13cb7-mtp (rocm)baselinechat1
52.7
49.5216.8139ms19.0301002.02s0.003 GiB
legacystack comparable
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp 4f13cb7-mtp (rocm)baselinerag1
52.5
46.21279.9538ms19.08422004.33s0.020 GiB
legacystack comparable
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp 4f13cb7-mtp (rocm)MTP n=3agent4
30.1
30.189.510.24s0.059950017.10s0.010 GiB
legacystack comparable
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp 4f13cb7-mtp (rocm)MTP n=2agent4
29.7
29.766.010.48s0.059950017.41s0.065 GiB

Environment

GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power450 W / 450 W max
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1980/2100 MHz · mem 9501 MHz
temp42°C idle · 69°C peak
peak draw383 W
hardware probes
copy 42% of theoryFP16 peak 65.4 TFcopy/math flat across caps
384-bit9751 MHz82 SM/CU
Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.
captheorycopyfp16bf16
200 W936 GB/s391 GB/s65.4 TF65.4 TF
300 W936 GB/s391 GB/s65.4 TF65.3 TF
450 W936 GB/s391 GB/s65.4 TF65.4 TF
compute: 8.6
backendllama.cpp cuda-4f13cb7 (cuda)
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driverNVIDIA 590.48.01 + CUDA 13.1
libc2.39
python3.12.3
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
cpuAMD RYZEN AI MAX+ 395 w/ Radeon 8060S
gpuAMD Radeon 8060S
archStrix Halo (gfx1151)
vram96 GiB (system 31.1 GiB, unified)
hardware probes
copy 41% of theoryFP16 peak 30.3 TF
256-bit8000 MHz20 SM/CU
Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.
captheorycopyfp16bf16
fixed256 GB/s106 GB/s30.3 TF-
compute: 11.5
backendllama.cpp 4f13cb7-mtp (rocm)
osUbuntu 24.04 LTS
kernel7.0.2-2-pve
python3.12.3
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue