Qwen3.5 27B
Q4_K_M·27B params·GGUF
reasoning
intelligence: see on Artificial Analysis →
checkpoint:
unsloth/Qwen3.5-27B-GGUF:Q4_K_Mcommit:
3221f178a6b8weights 15.59 GiB
All runs (20)
| legacy | stack comparable | GeForce RTX 3090 · 24 GiB450 W maxdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | chat | 1 | 42.7 | 37.8 | 116.4 | — | 258ms | 23.4 | — | 30 | 100 | 2.65s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiB450 W maxdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | rag | 1 | 42.4 | 34.1 | 1311.6 | — | 762ms | 23.6 | — | 842 | 200 | 5.86s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiB450 W maxdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | codegen | 1 | 42.2 | 40.5 | 219.0 | — | 358ms | 23.7 | — | 62 | 1000 | 24.70s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiB450 W maxdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | agent | 1 | 42.1 | 39.1 | 1180.7 | — | 507ms | 23.7 | — | 599 | 500 | 12.79s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiB450 W maxdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | agent | 4 | 42.0 | 16.5 | 41.1 | — | 19.29s | 23.8 | — | 599 | 500 | 31.61s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | chat | 1 | 41.7 | 37.9 | 127.3 | — | 236ms | 24.0 | — | 30 | 100 | 2.64s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | rag | 1 | 41.4 | 35.2 | 1316.8 | — | 785ms | 24.1 | — | 842 | 200 | 5.68s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | codegen | 1 | 41.4 | 39.8 | 202.4 | — | 305ms | 24.2 | — | 62 | 1000 | 25.14s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | agent | 1 | 41.3 | 38.4 | 1182.5 | — | 506ms | 24.2 | — | 599 | 500 | 13.01s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | agent | 4 | 41.3 | 16.1 | 31.4 | — | 19.67s | 24.2 | — | 599 | 500 | 32.24s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | chat | 1 | 22.9 | 21.2 | 126.7 | — | 253ms | 43.7 | — | 30 | 100 | 4.72s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | rag | 1 | 21.9 | 19.8 | 1037.8 | — | 912ms | 45.8 | — | 842 | 200 | 10.12s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | codegen | 1 | 21.6 | 21.4 | 199.6 | — | 346ms | 46.3 | — | 62 | 1000 | 46.81s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | agent | 1 | 21.5 | 20.9 | 980.5 | — | 611ms | 46.5 | — | 599 | 500 | 23.89s | 0.000 GiB |
| legacy | stack comparable | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b1203 (rocm) | baseline | chat | 1 | 12.0 | 11.6 | 96.9 | — | 330ms | 83.5 | — | 31 | 100 | 8.59s | 0.002 GiB |
| legacy | stack comparable | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b1203 (rocm) | baseline | codegen | 1 | 12.0 | 11.9 | 172.4 | — | 417ms | 83.5 | — | 63 | 1000 | 84.02s | 0.003 GiB |
| legacy | stack comparable | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b1203 (rocm) | baseline | rag | 1 | 12.0 | 10.9 | 294.6 | — | 1.73s | 83.6 | — | 1005 | 200 | 18.39s | 0.005 GiB |
| legacy | stack comparable | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b1203 (rocm) | baseline | agent | 1 | 12.0 | 11.5 | 279.9 | — | 1.75s | 83.6 | — | 599 | 500 | 43.53s | 0.008 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | agent | 4 | 10.9 | 9.7 | — | — | 3.91s | 91.5 | — | — | 341 | 35.41s | 0.000 GiB |
| legacy | stack comparable | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b1203 (rocm) | baseline | agent | 4 | 5.6 | 5.4 | 232.7 | — | 2.12s | 178.9 | — | 599 | 500 | 93.01s | -0.005 GiB |
Environment
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power350 W / 450 W max(78% cap)
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1965/2100 MHz · mem 9501 MHz
temp45°C idle · 64°C peak
peak draw334 W
hardware probes
copy 42% of theoryFP16 peak 65.4 TFcopy/math flat across caps
384-bit9751 MHz82 SM/CU
Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.
| cap | theory | copy | fp16 | bf16 |
|---|---|---|---|---|
| 200 W | 936 GB/s | 391 GB/s | 65.4 TF | 65.4 TF |
| 300 W | 936 GB/s | 391 GB/s | 65.4 TF | 65.3 TF |
| 450 W | 936 GB/s | 391 GB/s | 65.4 TF | 65.4 TF |
compute: 8.6
backendllama.cpp cuda-4f13cb7 (cuda)
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driverNVIDIA 590.48.01 + CUDA 13.1
libc2.39
python3.12.3
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power450 W / 450 W max
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1980/2100 MHz · mem 9501 MHz
temp40°C idle · 83°C peak
peak draw435 W
backendllama.cpp cuda-4f13cb7 (cuda)
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driverNVIDIA 590.48.01 + CUDA 13.1
libc2.39
python3.12.3
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power200 W / 450 W max(44% cap)
backendllama.cpp 59778f0 (cuda)
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driver590.48.01
python3.12.3
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
cpuAMD RYZEN AI MAX+ 395 w/ Radeon 8060S
gpuAMD Radeon 8060S
archStrix Halo (gfx1151)
vram96 GiB (system 31.1 GiB, unified)
hardware probes
copy 41% of theoryFP16 peak 30.3 TF
256-bit8000 MHz20 SM/CU
Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.
| cap | theory | copy | fp16 | bf16 |
|---|---|---|---|---|
| fixed | 256 GB/s | 106 GB/s | 30.3 TF | - |
compute: 11.5
backendllama.cpp b1203 (rocm)
osUbuntu 24.04.4 LTS
kernel7.0.2-2-pve
python3.12.3
runs/cell3
warmups1
endpoint/v1/chat/completions
streamingtrue