Gemma-4 E4B-it
Q4_K_M·4B params·128K ctx·GGUF
visiontool-callingtool-callingvisionllamacpp
checkpoint:
unsloth/gemma-4-E4B-it-GGUF:Q4_K_Mcommit:
ce152932ac27weights 5.56 GiB · on-disk 5.00 GiB
All runs (35)
| legacy | stack comparable | GeForce RTX 3090 · 24 GiB450 W maxdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | chat | 1 | 148.2 | 135.9 | 608.2 | — | 59ms | 6.7 | — | 36 | 100 | 736ms | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | chat | 1 | 147.3 | 137.3 | 538.7 | — | 69ms | 6.8 | — | 36 | 100 | 728ms | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiB450 W maxdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | codegen | 1 | 146.2 | 136.6 | 728.2 | — | 98ms | 6.8 | — | 71 | 1000 | 7.32s | 0.010 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiB450 W maxdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | rag | 1 | 145.3 | 118.9 | 4744.5 | — | 180ms | 6.9 | — | 853 | 200 | 1.68s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | codegen | 1 | 145.1 | 137.3 | 742.0 | — | 96ms | 6.9 | — | 71 | 1000 | 7.29s | 0.010 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiB450 W maxdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | agent | 4 | 144.8 | 56.1 | 139.4 | — | 5.62s | 6.9 | — | 618 | 500 | 9.23s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | rag | 1 | 144.6 | 122.5 | 5377.9 | — | 196ms | 6.9 | — | 853 | 200 | 1.63s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiB450 W maxdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | agent | 1 | 144.4 | 133.4 | 4191.2 | — | 148ms | 6.9 | — | 618 | 500 | 3.75s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | agent | 1 | 143.9 | 129.8 | 4493.2 | — | 138ms | 6.9 | — | 618 | 500 | 3.85s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | agent | 4 | 143.8 | 55.2 | 136.9 | — | 5.68s | 7.0 | — | 618 | 500 | 9.39s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595 | llama.cpp b9174 (vulkan) | baseline | chat | 1 | 129.8 | 124.3 | 612.5 | — | 59ms | 7.7 | — | 36 | 100 | 804ms | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595 | llama.cpp b9174 (vulkan) | baseline | codegen | 1 | 128.1 | 126.8 | 1069.1 | — | 67ms | 7.8 | — | 71 | 1000 | 7.89s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595 | llama.cpp b9174 (vulkan) | baseline | rag | 1 | 127.0 | 120.5 | 6792.3 | — | 118ms | 7.9 | — | 853 | 200 | 1.66s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595 | llama.cpp b9174 (vulkan) | baseline | agent | 1 | 124.8 | 123.2 | 10548.2 | — | 59ms | 8.0 | — | 618 | 500 | 4.06s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | chat | 1 | 123.6 | 118.4 | 545.3 | — | 66ms | 8.1 | — | 36 | 100 | 845ms | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | rag | 1 | 119.3 | 101.9 | 3051.7 | — | 307ms | 8.4 | — | 853 | 200 | 1.96s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | codegen | 1 | 118.7 | 117.2 | 703.3 | — | 101ms | 8.4 | — | 71 | 1000 | 8.53s | 0.010 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | agent | 1 | 116.2 | 111.7 | 3614.7 | — | 171ms | 8.6 | — | 618 | 500 | 4.48s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595 | llama.cpp b9174 (vulkan) | baseline | agent | 4 | 75.0 | 59.0 | 294.6 | — | 1.88s | 13.3 | — | 618 | 500 | 8.48s | 0.000 GiB |
| legacy | stack comparable | GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | agent | 4 | 64.0 | 52.2 | 460.0 | — | 1.77s | 15.6 | — | 618 | 500 | 9.58s | -0.010 GiB |
| legacy | stack comparable | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b8940 (vulkan) | baseline | chat | 1 | 55.3 | 52.9 | 252.3 | — | 148ms | 18.1 | — | 37 | 100 | 1.89s | 0.001 GiB |
| legacy | stack comparable | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b8940 (vulkan) | baseline | codegen | 1 | 54.1 | 53.8 | 457.4 | — | 170ms | 18.5 | — | 71 | 1000 | 18.58s | 0.000 GiB |
| legacy | stack comparable | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b8940 (vulkan) | baseline | agent | 1 | 53.7 | 51.6 | 1210.8 | — | 446ms | 18.6 | — | 618 | 500 | 9.69s | 0.000 GiB |
| legacy | stack comparable | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b8940 (vulkan) | baseline | rag | 1 | 53.2 | 50.4 | 2234.5 | — | 347ms | 18.8 | — | 1012 | 200 | 3.97s | 0.000 GiB |
| legacy | stack comparable | Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b8940 (vulkan) | baseline | agent | 4 | 35.1 | 25.5 | 143.3 | — | 5.18s | 28.5 | — | 618 | 500 | 19.63s | 0.001 GiB |
Environment
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
cpuAMD RYZEN AI MAX+ 395 w/ Radeon 8060S
gpuAMD Radeon 8060S
archStrix Halo (gfx1151)
vram96 GiB (system 31.1 GiB, unified)
hardware probes
copy 41% of theoryFP16 peak 30.3 TF
256-bit8000 MHz20 SM/CU
Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.
| cap | theory | copy | fp16 | bf16 |
|---|---|---|---|---|
| fixed | 256 GB/s | 106 GB/s | 30.3 TF | - |
compute: 11.5
backendllama.cpp b8940 (cpu)
osUbuntu 24.04.4 LTS
kernel7.0.2-2-pve
python3.12.3
runs/cell3
warmups1
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power350 W / 450 W max(78% cap)
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1980/2100 MHz · mem 9501 MHz
temp43°C idle · 62°C peak
peak draw341 W
hardware probes
copy 42% of theoryFP16 peak 65.4 TFcopy/math flat across caps
384-bit9751 MHz82 SM/CU
Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.
| cap | theory | copy | fp16 | bf16 |
|---|---|---|---|---|
| 200 W | 936 GB/s | 391 GB/s | 65.4 TF | 65.4 TF |
| 300 W | 936 GB/s | 391 GB/s | 65.4 TF | 65.3 TF |
| 450 W | 936 GB/s | 391 GB/s | 65.4 TF | 65.4 TF |
compute: 8.6
backendllama.cpp cuda-4f13cb7 (cuda)
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driverNVIDIA 590.48.01 + CUDA 13.1
libc2.39
python3.12.3
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power450 W / 450 W max
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1980/2100 MHz · mem 9501 MHz
temp53°C idle · 74°C peak
peak draw406 W
backendllama.cpp cuda-4f13cb7 (cuda)
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driverNVIDIA 590.48.01 + CUDA 13.1
libc2.39
python3.12.3
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power200 W / 450 W max(44% cap)
backendllama.cpp 59778f0 (cuda)
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driver590.48.01
python3.12.3
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
cpuAMD RYZEN AI MAX+ 395 w/ Radeon 8060S
gpuAMD Radeon 8060S
archStrix Halo (gfx1151)
vram96 GiB (system 31.1 GiB, unified)
backendllama.cpp b1203 (rocm)
osUbuntu 24.04.4 LTS
kernel7.0.2-2-pve
python3.12.3
runs/cell3
warmups1
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 5070 · 12 GiB
cpuAMD Ryzen 9 7900 12-Core Processor
gpuNVIDIA GeForce RTX 5070
archNVIDIA
vram11.94 GiB (system 30.4 GiB)
power250 W / 300 W max(83% cap)
hardware probes
copy 40% of theoryFP16 peak 69.6 TFcopy/math spread 2.5%
192-bit14001 MHz48 SM/CU
Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.
| cap | theory | copy | fp16 | bf16 |
|---|---|---|---|---|
| 200 W | 672 GB/s | 271 GB/s | 67.9 TF | 68.4 TF |
| 250 W | 672 GB/s | 271 GB/s | 69.5 TF | 68.2 TF |
| 300 W | 672 GB/s | 270 GB/s | 69.6 TF | 68.4 TF |
compute: 12
backendllama.cpp b9174 (vulkan)
osCachyOS
kernel7.0.0-1-cachyos
driver595.58.03
python3.14.4
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
cpuAMD RYZEN AI MAX+ 395 w/ Radeon 8060S
gpuAMD Radeon 8060S
archStrix Halo (gfx1151)
vram96 GiB (system 31.1 GiB, unified)
backendllama.cpp b8940 (vulkan)
osUbuntu 24.04.4 LTS
kernel7.0.2-2-pve
python3.12.3
runs/cell3
warmups1
endpoint/v1/chat/completions
streamingtrue