Qwen3.5 35B-A3B
Q4_K_M·35B params·GGUF
reasoning
intelligence: see on Artificial Analysis →
checkpoint:
unsloth/Qwen3.5-35B-A3B-GGUF:Q4_K_Mcommit:
bc014a17be43weights 20.50 GiB
All runs (19)
| Hardware | Backend | Mode | Shape | Conc. | Gen tok/s ↓ | Prefill tok/s | TTFT | TPOT (ms) | Prompt tok | Out tok | Total | VRAM Δ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | codegen | 1 | 137.1 | 410.6 | 159ms | 6.8 | 62 | 1000 | 7.26s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | codegen | 1 | 136.1 | 395.9 | 174ms | 6.8 | 62 | 1000 | 7.28s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | agent | 1 | 129.3 | 2796.0 | 222ms | 6.9 | 599 | 500 | 3.87s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | agent | 1 | 127.3 | 2599.3 | 234ms | 6.8 | 599 | 500 | 3.93s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | chat | 1 | 124.5 | 240.2 | 125ms | 6.7 | 30 | 100 | 803ms | 0.000 GiB |
GeForce RTX 3090 · 24 GiB350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | chat | 1 | 120.1 | 248.3 | 124ms | 6.8 | 30 | 100 | 833ms | 0.000 GiB |
GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | codegen | 1 | 119.3 | 392.9 | 170ms | 8.2 | 62 | 1000 | 8.37s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | rag | 1 | 110.4 | 2972.1 | 356ms | 6.8 | 842 | 200 | 1.81s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | chat | 1 | 109.8 | 244.3 | 123ms | 7.9 | 30 | 100 | 911ms | 0.000 GiB |
GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | agent | 1 | 109.1 | 1245.7 | 481ms | 8.1 | 599 | 500 | 4.58s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | rag | 1 | 108.3 | 3133.9 | 398ms | 6.8 | 842 | 200 | 1.85s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | rag | 1 | 94.2 | 1916.5 | 488ms | 8.1 | 842 | 200 | 2.12s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | agent | 4 | 54.9 | 119.5 | 5.82s | 6.8 | 599 | 500 | 9.41s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | agent | 4 | 54.3 | 118.3 | 5.83s | 6.9 | 599 | 500 | 9.52s | 0.000 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b1203 (rocm) | baseline | codegen | 1 | 48.3 | 352.0 | 197ms | 20.4 | 63 | 1000 | 20.65s | 0.002 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b1203 (rocm) | baseline | agent | 1 | 46.0 | 762.1 | 639ms | 20.5 | 599 | 500 | 10.87s | 0.005 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b1203 (rocm) | baseline | chat | 1 | 46.0 | 208.1 | 149ms | 20.4 | 31 | 100 | 2.17s | 0.002 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b1203 (rocm) | baseline | rag | 1 | 42.4 | 802.8 | 631ms | 20.5 | 1005 | 200 | 4.71s | 0.003 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b1203 (rocm) | baseline | agent | 4 | 17.6 | 462.9 | 1.29s | 53.0 | 599 | 500 | 28.43s | -0.003 GiB |
Environment
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power350 W / 450 W max(78% cap)
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1980/2100 MHz · mem 9501 MHz
temp41°C idle · 61°C peak
peak draw331 W
backendllama.cpp cuda-4f13cb7 (cuda)
serverlemonade unknown
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driverNVIDIA 590.48.01 + CUDA 13.1
libc2.39
python3.12.3
containerizedtrue
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power450 W / 450 W max
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1965/2100 MHz · mem 9501 MHz
temp43°C idle · 67°C peak
peak draw363 W
backendllama.cpp cuda-4f13cb7 (cuda)
serverlemonade unknown
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driverNVIDIA 590.48.01 + CUDA 13.1
libc2.39
python3.12.3
containerizedtrue
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power200 W / 450 W max(44% cap)
backendllama.cpp 59778f0 (cuda)
serverlemonade unknown
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driver590.48.01
python3.12.3
containerizedtrue
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
cpuAMD RYZEN AI MAX+ 395 w/ Radeon 8060S
gpuAMD Radeon 8060S
archStrix Halo (gfx1151)
vram96 GiB (system 31.1 GiB, unified)
backendllama.cpp b1203 (rocm)
serverlemonade 10.4.0
osUbuntu 24.04.4 LTS
kernel7.0.2-2-pve
python3.12.3
containerizedtrue
runs/cell3
warmups1
endpoint/v1/chat/completions
streamingtrue