Gemma-4 E2B-it

Q4_K_M·2B params·GGUF
checkpoint: unsloth/gemma-4-E2B-it-GGUF:Q4_K_M
commit: 90f961834039
weights 2.89 GiB

All runs (25)

HardwareBackendModeShapeConc.Gen tok/sPrefill tok/sTTFTTPOT (ms)Prompt tokOut tokTotalVRAM Δ
GeForce RTX 5070 · 12 GiB250 Wdrv 595
llama.cpp b9174 (vulkan)baselinecodegen1
216.7
1883.838ms4.67110004.62s0.000 GiB
GeForce RTX 5070 · 12 GiB250 Wdrv 595
llama.cpp b9174 (vulkan)baselinechat1
211.9
1068.234ms4.536100472ms0.000 GiB
GeForce RTX 3090 · 24 GiB350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wcodegen1
211.7
1275.456ms4.57110004.72s0.000 GiB
GeForce RTX 5070 · 12 GiB250 Wdrv 595
llama.cpp b9174 (vulkan)baselineagent1
209.5
20907.330ms4.76185002.39s0.000 GiB
GeForce RTX 3090 · 24 GiB350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent1
208.2
13621.051ms4.56185002.40s0.000 GiB
GeForce RTX 3090 · 24 GiB450 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wcodegen1
207.5
1309.854ms4.57110004.82s0.000 GiB
GeForce RTX 5070 · 12 GiB250 Wdrv 595
llama.cpp b9174 (vulkan)baselinerag1
206.5
12257.263ms4.6853200969ms0.000 GiB
GeForce RTX 3090 · 24 GiB350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wchat1
205.2
872.941ms4.436100487ms0.000 GiB
GeForce RTX 3090 · 24 GiB450 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent1
204.3
12671.749ms4.56185002.45s0.000 GiB
GeForce RTX 3090 · 24 GiB350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wrag1
196.0
11135.783ms4.58532001.02s0.000 GiB
GeForce RTX 3090 · 24 GiB200 Wdrv 590
llama.cpp 59778f0 (cuda)baselinecodegen1
195.1
1386.859ms5.17110005.13s0.000 GiB
GeForce RTX 3090 · 24 GiB450 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wrag1
194.9
10860.782ms4.58532001.03s0.000 GiB
GeForce RTX 3090 · 24 GiB200 Wdrv 590
llama.cpp 59778f0 (cuda)baselinechat1
193.1
827.244ms4.936100518ms0.000 GiB
GeForce RTX 3090 · 24 GiB450 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wchat1
192.3
915.140ms4.436100520ms0.000 GiB
GeForce RTX 3090 · 24 GiB200 Wdrv 590
llama.cpp 59778f0 (cuda)baselineagent1
189.3
7365.784ms5.16185002.64s0.000 GiB
GeForce RTX 3090 · 24 GiB200 Wdrv 590
llama.cpp 59778f0 (cuda)baselinerag1
181.1
7332.5116ms5.08532001.10s0.000 GiB
GeForce RTX 5070 · 12 GiB250 Wdrv 595
llama.cpp b9174 (vulkan)baselineagent4
97.6
776.0987ms8.26185005.12s0.000 GiB
GeForce RTX 3090 · 24 GiB350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent4
87.6
216.23.63s4.56185005.92s0.000 GiB
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp b8940 (rocm)baselinecodegen1
87.6
687.2103ms11.371100011.41s0.001 GiB
GeForce RTX 3090 · 24 GiB450 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent4
87.3
226.13.64s4.56185005.93s0.000 GiB
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp b8940 (rocm)baselinechat1
86.2
415.387ms11.2361001.16s0.002 GiB
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp b8940 (rocm)baselineagent1
83.7
6107.1101ms11.76185005.97s0.003 GiB
GeForce RTX 3090 · 24 GiB200 Wdrv 590
llama.cpp 59778f0 (cuda)baselineagent4
83.5
760.81.04s9.86185005.99s0.000 GiB
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp b8940 (rocm)baselinerag1
76.3
2779.0364ms11.510122002.62s0.003 GiB
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp b8940 (rocm)baselineagent4
29.4
331.12.06s30.161850017.02s0.008 GiB

Environment

GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power350 W / 450 W max(78% cap)
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1800/2100 MHz · mem 9501 MHz
temp43°C idle · 59°C peak
peak draw316 W
backendllama.cpp cuda-4f13cb7 (cuda)
serverlemonade unknown
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driverNVIDIA 590.48.01 + CUDA 13.1
libc2.39
python3.12.3
containerizedtrue
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power450 W / 450 W max
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1800/2100 MHz · mem 9501 MHz
temp55°C idle · 65°C peak
peak draw319 W
backendllama.cpp cuda-4f13cb7 (cuda)
serverlemonade unknown
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driverNVIDIA 590.48.01 + CUDA 13.1
libc2.39
python3.12.3
containerizedtrue
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power200 W / 450 W max(44% cap)
backendllama.cpp 59778f0 (cuda)
serverlemonade unknown
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driver590.48.01
python3.12.3
containerizedtrue
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
cpuAMD RYZEN AI MAX+ 395 w/ Radeon 8060S
gpuAMD Radeon 8060S
archStrix Halo (gfx1151)
vram96 GiB (system 31.1 GiB, unified)
backendllama.cpp b8940 (rocm)
serverlemonade 10.4.0
osUbuntu 24.04.4 LTS
kernel7.0.2-2-pve
python3.12.3
containerizedtrue
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 5070 · 12 GiB
cpuAMD Ryzen 9 7900 12-Core Processor
gpuNVIDIA GeForce RTX 5070
archNVIDIA
vram11.94 GiB (system 30.4 GiB)
power250 W / 300 W max(83% cap)
backendllama.cpp b9174 (vulkan)
serverlemonade unknown
osCachyOS
kernel7.0.0-1-cachyos
driver595.58.03
python3.14.4
containerizedfalse
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue