Gemma-4 26B-A4B-it
Q4_K_M·26B params·256K ctx·GGUF
visiontool-callinghottool-callingvisionllamacpp
intelligence: see on Artificial Analysis →
checkpoint:
unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_Mcommit:
b68961b3c96eweights 16.82 GiB · on-disk 16.90 GiB
All runs (20)
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline | chat | 1 | 119.5 | 104.1 | 370.0 | — | 103ms | 8.4 | — | 36 | 100 | 960ms | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline | codegen | 1 | 117.3 | 109.6 | 313.2 | — | 235ms | 8.5 | — | 71 | 1000 | 9.12s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline | rag | 1 | 116.4 | 84.9 | 1004.8 | — | 634ms | 8.6 | — | 853 | 200 | 2.35s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline | agent | 1 | 116.3 | 100.1 | 1229.9 | — | 426ms | 8.6 | — | 618 | 500 | 5.00s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline | agent | 4 | 116.0 | 42.4 | 85.0 | — | 7.71s | 8.6 | — | 618 | 500 | 12.21s | 0.000 GiB |
| legacy | stack comparable | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b8940 (vulkan) | baseline | chat | 1 | 52.0 | 47.7 | 151.7 | — | 244ms | 19.2 | — | 37 | 100 | 2.10s | 0.001 GiB |
| legacy | stack comparable | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b8940 (vulkan) | baseline | codegen | 1 | 48.3 | 47.9 | 277.2 | — | 296ms | 20.7 | — | 71 | 1000 | 20.88s | 0.000 GiB |
| legacy | stack comparable | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b8940 (vulkan) | baseline | agent | 1 | 47.3 | 44.8 | 744.6 | — | 712ms | 21.1 | — | 618 | 500 | 11.15s | 0.000 GiB |
| legacy | stack comparable | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b8940 (vulkan) | baseline | rag | 1 | 47.2 | 43.2 | 1212.2 | — | 590ms | 21.2 | — | 1012 | 200 | 4.63s | 0.000 GiB |
| legacy | stack comparable | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b8940 (vulkan) | baseline | agent | 4 | 24.5 | 18.3 | 84.6 | — | 7.29s | 40.8 | — | 618 | 500 | 27.38s | 0.000 GiB |
Environment
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
cpuAMD RYZEN AI MAX+ 395 w/ Radeon 8060S
gpuAMD Radeon 8060S
archStrix Halo (gfx1151)
vram96 GiB (system 31.1 GiB, unified)
hardware probes
copy 41% of theoryFP16 peak 30.3 TF
256-bit8000 MHz20 SM/CU
Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.
| cap | theory | copy | fp16 | bf16 |
|---|---|---|---|---|
| fixed | 256 GB/s | 106 GB/s | 30.3 TF | - |
compute: 11.5
backendllama.cpp b8940 (cpu)
osUbuntu 24.04.4 LTS
kernel7.0.2-2-pve
python3.12.3
runs/cell3
warmups1
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power300 W / 450 W max(67% cap)
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1920/2100 MHz · mem 9501 MHz
temp52°C idle · 67°C peak
peak draw295 W
hardware probes
copy 42% of theoryFP16 peak 65.4 TFcopy/math flat across caps
384-bit9751 MHz82 SM/CU
Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.
| cap | theory | copy | fp16 | bf16 |
|---|---|---|---|---|
| 200 W | 936 GB/s | 391 GB/s | 65.4 TF | 65.4 TF |
| 300 W | 936 GB/s | 391 GB/s | 65.4 TF | 65.3 TF |
| 450 W | 936 GB/s | 391 GB/s | 65.4 TF | 65.4 TF |
compute: 8.6
backendllama.cpp cuda-4f13cb7 (cuda)
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driverNVIDIA 590.48.01 + CUDA 13.1
libc2.39
python3.12.3
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
cpuAMD RYZEN AI MAX+ 395 w/ Radeon 8060S
gpuAMD Radeon 8060S
archStrix Halo (gfx1151)
vram96 GiB (system 31.1 GiB, unified)
backendllama.cpp b1203 (rocm)
osUbuntu 24.04.4 LTS
kernel7.0.2-2-pve
python3.12.3
runs/cell3
warmups1
endpoint/v1/chat/completions
streamingtrue
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
cpuAMD RYZEN AI MAX+ 395 w/ Radeon 8060S
gpuAMD Radeon 8060S
archStrix Halo (gfx1151)
vram96 GiB (system 31.1 GiB, unified)
backendllama.cpp b8940 (vulkan)
osUbuntu 24.04.4 LTS
kernel7.0.2-2-pve
python3.12.3
runs/cell3
warmups1
endpoint/v1/chat/completions
streamingtrue