Gemma-4 E2B-it
Q4_K_M·2B params·GGUF
intelligence: see on Artificial Analysis →
checkpoint:
unsloth/gemma-4-E2B-it-GGUF:Q4_K_Mcommit:
90f961834039weights 2.89 GiB
All runs (25)
| Hardware | Backend | Mode | Shape | Conc. | Gen tok/s ↓ | Prefill tok/s | TTFT | TPOT (ms) | Prompt tok | Out tok | Total | VRAM Δ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
GeForce RTX 5070 · 12 GiB250 Wdrv 595 | llama.cpp b9174 (vulkan) | baseline | codegen | 1 | 216.7 | 1883.8 | 38ms | 4.6 | 71 | 1000 | 4.62s | 0.000 GiB |
GeForce RTX 5070 · 12 GiB250 Wdrv 595 | llama.cpp b9174 (vulkan) | baseline | chat | 1 | 211.9 | 1068.2 | 34ms | 4.5 | 36 | 100 | 472ms | 0.000 GiB |
GeForce RTX 3090 · 24 GiB350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | codegen | 1 | 211.7 | 1275.4 | 56ms | 4.5 | 71 | 1000 | 4.72s | 0.000 GiB |
GeForce RTX 5070 · 12 GiB250 Wdrv 595 | llama.cpp b9174 (vulkan) | baseline | agent | 1 | 209.5 | 20907.3 | 30ms | 4.7 | 618 | 500 | 2.39s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | agent | 1 | 208.2 | 13621.0 | 51ms | 4.5 | 618 | 500 | 2.40s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | codegen | 1 | 207.5 | 1309.8 | 54ms | 4.5 | 71 | 1000 | 4.82s | 0.000 GiB |
GeForce RTX 5070 · 12 GiB250 Wdrv 595 | llama.cpp b9174 (vulkan) | baseline | rag | 1 | 206.5 | 12257.2 | 63ms | 4.6 | 853 | 200 | 969ms | 0.000 GiB |
GeForce RTX 3090 · 24 GiB350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | chat | 1 | 205.2 | 872.9 | 41ms | 4.4 | 36 | 100 | 487ms | 0.000 GiB |
GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | agent | 1 | 204.3 | 12671.7 | 49ms | 4.5 | 618 | 500 | 2.45s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | rag | 1 | 196.0 | 11135.7 | 83ms | 4.5 | 853 | 200 | 1.02s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | codegen | 1 | 195.1 | 1386.8 | 59ms | 5.1 | 71 | 1000 | 5.13s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | rag | 1 | 194.9 | 10860.7 | 82ms | 4.5 | 853 | 200 | 1.03s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | chat | 1 | 193.1 | 827.2 | 44ms | 4.9 | 36 | 100 | 518ms | 0.000 GiB |
GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | chat | 1 | 192.3 | 915.1 | 40ms | 4.4 | 36 | 100 | 520ms | 0.000 GiB |
GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | agent | 1 | 189.3 | 7365.7 | 84ms | 5.1 | 618 | 500 | 2.64s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | rag | 1 | 181.1 | 7332.5 | 116ms | 5.0 | 853 | 200 | 1.10s | 0.000 GiB |
GeForce RTX 5070 · 12 GiB250 Wdrv 595 | llama.cpp b9174 (vulkan) | baseline | agent | 4 | 97.6 | 776.0 | 987ms | 8.2 | 618 | 500 | 5.12s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB350 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-350w | agent | 4 | 87.6 | 216.2 | 3.63s | 4.5 | 618 | 500 | 5.92s | 0.000 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b8940 (rocm) | baseline | codegen | 1 | 87.6 | 687.2 | 103ms | 11.3 | 71 | 1000 | 11.41s | 0.001 GiB |
GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline-pl-450w | agent | 4 | 87.3 | 226.1 | 3.64s | 4.5 | 618 | 500 | 5.93s | 0.000 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b8940 (rocm) | baseline | chat | 1 | 86.2 | 415.3 | 87ms | 11.2 | 36 | 100 | 1.16s | 0.002 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b8940 (rocm) | baseline | agent | 1 | 83.7 | 6107.1 | 101ms | 11.7 | 618 | 500 | 5.97s | 0.003 GiB |
GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 59778f0 (cuda) | baseline | agent | 4 | 83.5 | 760.8 | 1.04s | 9.8 | 618 | 500 | 5.99s | 0.000 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b8940 (rocm) | baseline | rag | 1 | 76.3 | 2779.0 | 364ms | 11.5 | 1012 | 200 | 2.62s | 0.003 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b8940 (rocm) | baseline | agent | 4 | 29.4 | 331.1 | 2.06s | 30.1 | 618 | 500 | 17.02s | 0.008 GiB |
Environment
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power350 W / 450 W max(78% cap)
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1800/2100 MHz · mem 9501 MHz
temp43°C idle · 59°C peak
peak draw316 W
backendllama.cpp cuda-4f13cb7 (cuda)
serverlemonade unknown
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driverNVIDIA 590.48.01 + CUDA 13.1
libc2.39
python3.12.3
containerizedtrue
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power450 W / 450 W max
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1800/2100 MHz · mem 9501 MHz
temp55°C idle · 65°C peak
peak draw319 W
backendllama.cpp cuda-4f13cb7 (cuda)
serverlemonade unknown
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driverNVIDIA 590.48.01 + CUDA 13.1
libc2.39
python3.12.3
containerizedtrue
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power200 W / 450 W max(44% cap)
backendllama.cpp 59778f0 (cuda)
serverlemonade unknown
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driver590.48.01
python3.12.3
containerizedtrue
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
cpuAMD RYZEN AI MAX+ 395 w/ Radeon 8060S
gpuAMD Radeon 8060S
archStrix Halo (gfx1151)
vram96 GiB (system 31.1 GiB, unified)
backendllama.cpp b8940 (rocm)
serverlemonade 10.4.0
osUbuntu 24.04.4 LTS
kernel7.0.2-2-pve
python3.12.3
containerizedtrue
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 5070 · 12 GiB
cpuAMD Ryzen 9 7900 12-Core Processor
gpuNVIDIA GeForce RTX 5070
archNVIDIA
vram11.94 GiB (system 30.4 GiB)
power250 W / 300 W max(83% cap)
backendllama.cpp b9174 (vulkan)
serverlemonade unknown
osCachyOS
kernel7.0.0-1-cachyos
driver595.58.03
python3.14.4
containerizedfalse
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue