Skip to content

Gemma-4 26B-A4B-it

Q4_K_M·26B params·256K ctx·GGUF
visiontool-callinghottool-callingvisionllamacpp
checkpoint: unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_M
commit: b68961b3c96e
weights 16.82 GiB · on-disk 16.90 GiB

All runs (20)

legacystack comparable
GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baselinechat1
119.5
104.1370.0103ms8.436100960ms0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baselinecodegen1
117.3
109.6313.2235ms8.57110009.12s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baselinerag1
116.4
84.91004.8634ms8.68532002.35s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baselineagent1
116.3
100.11229.9426ms8.66185005.00s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baselineagent4
116.0
42.485.07.71s8.661850012.21s0.000 GiB
legacystack comparable
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp b8940 (vulkan)baselinechat1
52.0
47.7151.7244ms19.2371002.10s0.001 GiB
legacystack comparable
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp b8940 (vulkan)baselinecodegen1
48.3
47.9277.2296ms20.771100020.88s0.000 GiB
legacystack comparable
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp b8940 (vulkan)baselineagent1
47.3
44.8744.6712ms21.161850011.15s0.000 GiB
legacystack comparable
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp b8940 (vulkan)baselinerag1
47.2
43.21212.2590ms21.210122004.63s0.000 GiB
legacystack comparable
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp b8940 (vulkan)baselineagent4
24.5
18.384.67.29s40.861850027.38s0.000 GiB

Environment

Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
cpuAMD RYZEN AI MAX+ 395 w/ Radeon 8060S
gpuAMD Radeon 8060S
archStrix Halo (gfx1151)
vram96 GiB (system 31.1 GiB, unified)
hardware probes
copy 41% of theoryFP16 peak 30.3 TF
256-bit8000 MHz20 SM/CU
Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.
captheorycopyfp16bf16
fixed256 GB/s106 GB/s30.3 TF-
compute: 11.5
backendllama.cpp b8940 (cpu)
osUbuntu 24.04.4 LTS
kernel7.0.2-2-pve
python3.12.3
runs/cell3
warmups1
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power300 W / 450 W max(67% cap)
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1920/2100 MHz · mem 9501 MHz
temp52°C idle · 67°C peak
peak draw295 W
hardware probes
copy 42% of theoryFP16 peak 65.4 TFcopy/math flat across caps
384-bit9751 MHz82 SM/CU
Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.
captheorycopyfp16bf16
200 W936 GB/s391 GB/s65.4 TF65.4 TF
300 W936 GB/s391 GB/s65.4 TF65.3 TF
450 W936 GB/s391 GB/s65.4 TF65.4 TF
compute: 8.6
backendllama.cpp cuda-4f13cb7 (cuda)
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driverNVIDIA 590.48.01 + CUDA 13.1
libc2.39
python3.12.3
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
cpuAMD RYZEN AI MAX+ 395 w/ Radeon 8060S
gpuAMD Radeon 8060S
archStrix Halo (gfx1151)
vram96 GiB (system 31.1 GiB, unified)
backendllama.cpp b1203 (rocm)
osUbuntu 24.04.4 LTS
kernel7.0.2-2-pve
python3.12.3
runs/cell3
warmups1
endpoint/v1/chat/completions
streamingtrue
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
cpuAMD RYZEN AI MAX+ 395 w/ Radeon 8060S
gpuAMD Radeon 8060S
archStrix Halo (gfx1151)
vram96 GiB (system 31.1 GiB, unified)
backendllama.cpp b8940 (vulkan)
osUbuntu 24.04.4 LTS
kernel7.0.2-2-pve
python3.12.3
runs/cell3
warmups1
endpoint/v1/chat/completions
streamingtrue