Gemma-3 4b-it
Q4_K_M·4B params·GGUF
intelligence: see on Artificial Analysis →
checkpoint:
ggml-org/gemma-3-4b-it-GGUF:Q4_K_MAll runs (15)
| Hardware | Backend | Shape | Conc. | Gen tok/s ↓ | TTFT | TPOT (ms) | Out tok | Total | VRAM Δ |
|---|---|---|---|---|---|---|---|---|---|
| GeForce RTX 5070 · 11.94 GiB | llama.cpp b9174 (cuda) | codegen | 1 | 157.9 | 49ms | 6.3 | 1000 | 6.33s | 0.000 GiB |
| GeForce RTX 5070 · 11.94 GiB | llama.cpp b9174 (cuda) | chat | 1 | 156.6 | 29ms | 6.1 | 100 | 636ms | 0.000 GiB |
| GeForce RTX 5070 · 11.94 GiB | llama.cpp b9174 (cuda) | agent | 1 | 151.3 | 66ms | 6.4 | 330 | 2.17s | 0.000 GiB |
| GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | chat | 1 | 142.0 | 39ms | 6.7 | 100 | 703ms | 0.000 GiB |
| GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | codegen | 1 | 137.0 | 125ms | 7.1 | 1000 | 7.30s | 0.000 GiB |
| GeForce RTX 5070 · 11.94 GiB | llama.cpp b9174 (cuda) | rag | 1 | 136.4 | 88ms | 6.4 | 67 | 552ms | 0.000 GiB |
| GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | agent | 1 | 125.2 | 240ms | 7.1 | 323 | 2.64s | 0.000 GiB |
| GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | rag | 1 | 86.1 | 341ms | 6.9 | 70 | 749ms | 0.000 GiB |
| GeForce RTX 5070 · 11.94 GiB | llama.cpp b9174 (cuda) | agent | 4 | 70.6 | 1.62s | 11.3 | 437 | 5.65s | 0.000 GiB |
| Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b1203 (rocm) | codegen | 1 | 64.6 | 99ms | 15.4 | 1000 | 15.47s | 0.002 GiB |
| Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b1203 (rocm) | chat | 1 | 64.2 | 59ms | 15.1 | 100 | 1.56s | 0.001 GiB |
| Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b1203 (rocm) | agent | 1 | 61.3 | 426ms | 15.5 | 354 | 5.97s | 0.002 GiB |
| Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b1203 (rocm) | rag | 1 | 55.5 | 325ms | 15.7 | 67 | 1.60s | 0.002 GiB |
| GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | agent | 4 | 51.2 | 2.99s | 12.7 | 438 | 8.03s | 0.000 GiB |
| Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM) | llama.cpp b1203 (rocm) | agent | 4 | 17.8 | 3.36s | 45.0 | 376 | 21.06s | 0.001 GiB |
Environment
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power200 W / 450 W max(44% cap)
backendllama.cpp 59778f0 (cuda)
serverlemonade unknown
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driver590.48.01
python3.12.3
containerizedtrue
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
cpuAMD RYZEN AI MAX+ 395 w/ Radeon 8060S
gpuAMD Radeon 8060S
archStrix Halo (gfx1151)
vram96 GiB (system 31.1 GiB, unified)
backendllama.cpp b1203 (rocm)
serverlemonade 10.4.0
osUbuntu 24.04.4 LTS
kernel7.0.2-2-pve
python3.12.3
containerizedtrue
runs/cell3
warmups1
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 5070 · 11.94 GiB
cpuAMD Ryzen 9 7900 12-Core Processor
gpuNVIDIA GeForce RTX 5070
archNVIDIA
vram11.94 GiB (system 30.4 GiB)
power250 W / 300 W max(83% cap)
backendllama.cpp b9174 (cuda)
serverlemonade unknown
osCachyOS
kernel7.0.0-1-cachyos
driver595.58.03
python3.14.4
containerizedfalse
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue