Gemma-3 4b-it
Q4_K_M·4B params·GGUF
intelligence: see on Artificial Analysis →
checkpoint:
unsloth/gemma-3-4b-it-GGUF:gemma-3-4b-it-Q4_K_M.ggufAll runs (40)
| Hardware | Backend | Mode | Shape | Conc. | Gen tok/s ↓ | Prefill tok/s | TTFT | TPOT (ms) | Prompt tok | Out tok | Total | VRAM Δ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | codegen | 1 | 169.4 | 539.9 | 129ms | 5.5 | 64 | 1000 | 5.90s | 0.000 GiB |
GeForce RTX 5070 · 12 GiB250 Wdrv 595 | llama.cpp cuda-1a68ec9 (cuda) | baseline | codegen | 1 | 168.7 | 2014.9 | 35ms | 5.9 | 64 | 1000 | 5.93s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | chat | 1 | 167.9 | 788.5 | 38ms | 5.4 | 29 | 100 | 573ms | 0.000 GiB |
GeForce RTX 5070 · 12 GiB250 Wdrv 595 | llama.cpp cuda-1a68ec9 (cuda) | baseline | chat | 1 | 166.9 | 1377.8 | 21ms | 5.8 | 29 | 100 | 595ms | 0.000 GiB |
GeForce RTX 3090 · 24 GiB350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | codegen | 1 | 166.7 | 502.5 | 127ms | 5.6 | 64 | 1000 | 6.00s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | chat | 1 | 164.4 | 816.3 | 38ms | 5.5 | 29 | 100 | 582ms | 0.000 GiB |
GeForce RTX 5070 · 12 GiB250 Wdrv 595 | llama.cpp cuda-1a68ec9 (cuda) | baseline | agent | 1 | 163.7 | 10203.9 | 60ms | 5.9 | 611 | 376 | 2.29s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB300 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline | codegen | 1 | 160.9 | 531.5 | 123ms | 5.7 | 64 | 1000 | 6.21s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB300 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline | chat | 1 | 160.0 | 807.9 | 38ms | 5.6 | 29 | 100 | 597ms | 0.000 GiB |
GeForce RTX 3090 · 24 GiB350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | agent | 1 | 153.2 | 2267.3 | 228ms | 5.6 | 611 | 436 | 2.81s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | agent | 1 | 152.8 | 2316.2 | 223ms | 5.6 | 611 | 436 | 2.77s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB300 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline | agent | 1 | 143.7 | 2266.9 | 228ms | 5.8 | 611 | 436 | 2.84s | 0.000 GiB |
GeForce RTX 5070 · 12 GiB250 Wdrv 595 | llama.cpp cuda-1a68ec9 (cuda) | baseline | rag | 1 | 143.0 | 10212.2 | 83ms | 5.9 | 846 | 67 | 532ms | 0.000 GiB |
GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-200w | chat | 1 | 128.5 | 652.1 | 45ms | 7.0 | 29 | 100 | 742ms | 0.000 GiB |
GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-200w | codegen | 1 | 120.9 | 540.1 | 127ms | 7.8 | 64 | 1000 | 8.27s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-200w | agent | 1 | 111.1 | 2080.5 | 249ms | 7.8 | 611 | 436 | 3.91s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | rag | 1 | 101.6 | 1932.8 | 387ms | 5.6 | 846 | 70 | 792ms | 0.000 GiB |
GeForce RTX 3090 · 24 GiB300 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline | rag | 1 | 97.1 | 2050.5 | 335ms | 5.7 | 846 | 70 | 701ms | 0.000 GiB |
GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | rag | 1 | 95.5 | 1871.9 | 339ms | 5.5 | 846 | 70 | 688ms | 0.000 GiB |
GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-200w | rag | 1 | 84.0 | 1908.2 | 340ms | 7.5 | 846 | 70 | 803ms | 0.000 GiB |
GeForce RTX 5070 · 12 GiB250 Wdrv 595 | llama.cpp b9174 (vulkan) | baseline | agent | 4 | 70.6 | 539.3 | 1.62s | 11.3 | 611 | 437 | 5.65s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | agent | 4 | 65.9 | 152.1 | 3.87s | 5.6 | 611 | 436 | 6.77s | 0.000 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b1203 (rocm) | baseline | codegen | 1 | 64.6 | — | 99ms | 15.4 | — | 1000 | 15.47s | 0.002 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b1203 (rocm) | baseline | chat | 1 | 64.2 | — | 59ms | 15.1 | — | 100 | 1.56s | 0.001 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b1203 (rocm) | baseline | agent | 1 | 61.3 | — | 426ms | 15.5 | — | 354 | 5.97s | 0.002 GiB |
GeForce RTX 3090 · 24 GiB300 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline | agent | 4 | 60.8 | 153.6 | 4.20s | 5.8 | 611 | 436 | 7.07s | 0.000 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b1203 (rocm) | baseline | rag | 1 | 55.5 | — | 325ms | 15.7 | — | 67 | 1.60s | 0.002 GiB |
GeForce RTX 3090 · 24 GiB350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | agent | 4 | 54.6 | 168.7 | 4.16s | 5.6 | 611 | 436 | 6.39s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-200w | agent | 4 | 53.0 | 121.7 | 5.41s | 7.8 | 611 | 436 | 9.51s | 0.000 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b1203 (rocm) | baseline | agent | 4 | 17.8 | — | 3.36s | 45.0 | — | 376 | 21.06s | 0.001 GiB |
Environment
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power200 W / 450 W max(44% cap)
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1965/2100 MHz · mem 9501 MHz
temp44°C idle · 46°C peak
peak draw196 W
backendllama.cpp cuda-4f13cb7 (cuda)
serverlemonade unknown
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driverNVIDIA 590.48.01 + CUDA 13.1
libc2.39
python3.12.3
containerizedtrue
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power350 W / 450 W max(78% cap)
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1980/2100 MHz · mem 9501 MHz
temp54°C idle · 62°C peak
peak draw335 W
backendllama.cpp cuda-4f13cb7 (cuda)
serverlemonade unknown
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driverNVIDIA 590.48.01 + CUDA 13.1
libc2.39
python3.12.3
containerizedtrue
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power450 W / 450 W max
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1965/2100 MHz · mem 9501 MHz
temp60°C idle · 77°C peak
peak draw433 W
backendllama.cpp cuda-4f13cb7 (cuda)
serverlemonade unknown
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driverNVIDIA 590.48.01 + CUDA 13.1
libc2.39
python3.12.3
containerizedtrue
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power300 W / 450 W max(67% cap)
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1950/2100 MHz · mem 9501 MHz
temp37°C idle · 64°C peak
peak draw291 W
backendllama.cpp cuda-4f13cb7 (cuda)
serverlemonade unknown
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driverNVIDIA 590.48.01 + CUDA 13.1
libc2.39
python3.12.3
containerizedtrue
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 5070 · 12 GiB
cpuAMD Ryzen 9 7900 12-Core Processor
gpuNVIDIA GeForce RTX 5070
archNVIDIA
vram11.94 GiB (system 30.5 GiB)
power250 W / 300 W max(83% cap)
pcieGen 1 x16 / Gen 4 x16 max
clocksgfx 180/3090 MHz · mem 405 MHz
temp31°C idle · 64°C peak
peak draw194 W
backendllama.cpp cuda-1a68ec9 (cuda)
serverlemonade unknown
osCachyOS
kernel7.0.8-1-cachyos
driverNVIDIA 595.71.05 + CUDA 13.2
libc2.43
python3.14.4
containerizedfalse
build flagsGGML_CUDA=ON CMAKE_CUDA_ARCHITECTURES=120 CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
cpuAMD RYZEN AI MAX+ 395 w/ Radeon 8060S
gpuAMD Radeon 8060S
archStrix Halo (gfx1151)
vram96 GiB (system 31.1 GiB, unified)
backendllama.cpp b1203 (rocm)
serverlemonade 10.4.0
osUbuntu 24.04.4 LTS
kernel7.0.2-2-pve
python3.12.3
containerizedtrue
runs/cell3
warmups1
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 5070 · 12 GiB
cpuAMD Ryzen 9 7900 12-Core Processor
gpuNVIDIA GeForce RTX 5070
archNVIDIA
vram11.94 GiB (system 30.5 GiB)
power250 W / 300 W max(83% cap)
pcieGen 1 x16 / Gen 4 x16 max
clocksgfx 180/3090 MHz · mem 405 MHz
temp39°C idle · 62°C peak
peak draw175 W
backendllama.cpp vulkan-1a68ec9 (vulkan)
serverlemonade unknown
osCachyOS
kernel7.0.8-1-cachyos
driverNVIDIA 595.71.05 + CUDA 13.2
libc2.43
python3.14.4
containerizedfalse
llama.cppversion: 1 (1a68ec9) built with GNU 15.2.1 for Linux x86_64
build flagsGGML_VULKAN=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 5070 · 12 GiB
cpuAMD Ryzen 9 7900 12-Core Processor
gpuNVIDIA GeForce RTX 5070
archNVIDIA
vram11.94 GiB (system 30.4 GiB)
power250 W / 300 W max(83% cap)
backendllama.cpp b9174 (vulkan)
serverlemonade unknown
osCachyOS
kernel7.0.0-1-cachyos
driver595.58.03
python3.14.4
containerizedfalse
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue