Gemma-4 E4B-it
Q4_K_M·4B params·128K ctx·GGUF
visiontool-callingtool-callingvisionllamacpp
checkpoint:
unsloth/gemma-4-E4B-it-GGUF:Q4_K_Mcommit:
ce152932ac27weights 5.56 GiB · on-disk 5.00 GiB
All runs (25)
| Hardware | Backend | Shape | Conc. | Gen tok/s ↓ | TTFT | TPOT (ms) | Out tok | Total | VRAM Δ |
|---|---|---|---|---|---|---|---|---|---|
| GeForce RTX 5070 · 11.94 GiB | llama.cpp b9174 (cuda) | codegen | 1 | 126.8 | 67ms | 7.8 | 1000 | 7.89s | 0.000 GiB |
| GeForce RTX 5070 · 11.94 GiB | llama.cpp b9174 (cuda) | chat | 1 | 124.3 | 59ms | 7.7 | 100 | 804ms | 0.000 GiB |
| GeForce RTX 5070 · 11.94 GiB | llama.cpp b9174 (cuda) | agent | 1 | 123.2 | 59ms | 8.0 | 500 | 4.06s | 0.000 GiB |
| GeForce RTX 5070 · 11.94 GiB | llama.cpp b9174 (cuda) | rag | 1 | 120.5 | 118ms | 7.9 | 200 | 1.66s | 0.000 GiB |
| GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | chat | 1 | 118.4 | 66ms | 8.1 | 100 | 845ms | 0.000 GiB |
| GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | codegen | 1 | 117.2 | 101ms | 8.4 | 1000 | 8.53s | 0.010 GiB |
| GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | agent | 1 | 111.7 | 171ms | 8.6 | 500 | 4.48s | 0.000 GiB |
| GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | rag | 1 | 101.9 | 307ms | 8.4 | 200 | 1.96s | 0.000 GiB |
| GeForce RTX 5070 · 11.94 GiB | llama.cpp b9174 (cuda) | agent | 4 | 59.0 | 1.88s | 13.3 | 500 | 8.48s | 0.000 GiB |
| Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b8940 (vulkan) | codegen | 1 | 53.8 | 170ms | 18.5 | 1000 | 18.58s | 0.000 GiB |
| Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b8940 (vulkan) | chat | 1 | 52.9 | 148ms | 18.1 | 100 | 1.89s | 0.001 GiB |
| Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b1203 (rocm) | codegen | 1 | 52.4 | 164ms | 18.9 | 996 | 18.98s | 0.001 GiB |
| GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | agent | 4 | 52.2 | 1.77s | 15.6 | 500 | 9.58s | -0.010 GiB |
| Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b8940 (vulkan) | agent | 1 | 51.6 | 446ms | 18.6 | 500 | 9.69s | 0.000 GiB |
| Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b8940 (vulkan) | rag | 1 | 50.4 | 347ms | 18.8 | 200 | 3.97s | 0.000 GiB |
| Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b1203 (rocm) | chat | 1 | 50.3 | 141ms | 18.6 | 97 | 1.93s | 0.001 GiB |
| Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b1203 (rocm) | agent | 1 | 49.8 | 561ms | 19.1 | 497 | 9.98s | 0.005 GiB |
| Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b1203 (rocm) | rag | 1 | 48.1 | 382ms | 19.2 | 197 | 4.09s | 0.002 GiB |
| Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b8940 (vulkan) | agent | 4 | 25.5 | 5.18s | 28.5 | 500 | 19.63s | 0.001 GiB |
| Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b1203 (rocm) | agent | 4 | 19.0 | 3.28s | 47.0 | 497 | 26.21s | 0.009 GiB |
| Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b8940 (cpu) | codegen | 1 | 10.2 | 1.19s | 96.6 | 1000 | 98.34s | 0.000 GiB |
| Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b8940 (cpu) | chat | 1 | 10.0 | 844ms | 94.3 | 100 | 9.99s | 0.000 GiB |
| Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b8940 (cpu) | agent | 1 | 9.1 | 6.49s | 97.5 | 500 | 55.13s | 0.000 GiB |
| Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b8940 (cpu) | rag | 1 | 8.6 | 4.16s | 99.3 | 200 | 23.25s | 0.000 GiB |
| Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b8940 (cpu) | agent | 4 | 6.0 | 8.71s | 149.9 | 500 | 83.66s | 0.000 GiB |
Environment
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
cpuAMD RYZEN AI MAX+ 395 w/ Radeon 8060S
gpuAMD Radeon 8060S
archStrix Halo (gfx1151)
vram96 GiB (system 31.1 GiB, unified)
backendllama.cpp b8940 (cpu)
serverlemonade 10.4.0
osUbuntu 24.04.4 LTS
kernel7.0.2-2-pve
python3.12.3
containerizedtrue
runs/cell3
warmups1
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power200 W / 450 W max(44% cap)
backendllama.cpp 59778f0 (cuda)
serverlemonade unknown
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driver590.48.01
python3.12.3
containerizedtrue
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
cpuAMD RYZEN AI MAX+ 395 w/ Radeon 8060S
gpuAMD Radeon 8060S
archStrix Halo (gfx1151)
vram96 GiB (system 31.1 GiB, unified)
backendllama.cpp b1203 (rocm)
serverlemonade 10.4.0
osUbuntu 24.04.4 LTS
kernel7.0.2-2-pve
python3.12.3
containerizedtrue
runs/cell3
warmups1
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 5070 · 11.94 GiB
cpuAMD Ryzen 9 7900 12-Core Processor
gpuNVIDIA GeForce RTX 5070
archNVIDIA
vram11.94 GiB (system 30.4 GiB)
power250 W / 300 W max(83% cap)
backendllama.cpp b9174 (cuda)
serverlemonade unknown
osCachyOS
kernel7.0.0-1-cachyos
driver595.58.03
python3.14.4
containerizedfalse
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
cpuAMD RYZEN AI MAX+ 395 w/ Radeon 8060S
gpuAMD Radeon 8060S
archStrix Halo (gfx1151)
vram96 GiB (system 31.1 GiB, unified)
backendllama.cpp b8940 (vulkan)
serverlemonade 10.4.0
osUbuntu 24.04.4 LTS
kernel7.0.2-2-pve
python3.12.3
containerizedtrue
runs/cell3
warmups1
endpoint/v1/chat/completions
streamingtrue