Skip to content

Gemma-4 E2B-it

Q4_K_M·2B params·GGUF
checkpoint: unsloth/gemma-4-E2B-it-GGUF:Q4_K_M
commit: 90f961834039
weights 2.89 GiB

All runs (25)

legacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wchat1
226.4
205.2872.941ms4.436100487ms0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wchat1
226.2
192.3915.140ms4.436100520ms0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wrag1
224.5
196.011135.783ms4.58532001.02s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wcodegen1
224.3
211.71275.456ms4.57110004.72s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent4
223.7
87.6216.23.63s4.56185005.92s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wcodegen1
223.7
207.51309.854ms4.57110004.82s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent1
223.3
208.213621.051ms4.56185002.40s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent4
223.0
87.3226.13.64s4.56185005.93s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent1
222.5
204.312671.749ms4.56185002.45s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiB450 W maxdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wrag1
222.2
194.910860.782ms4.58532001.03s0.000 GiB
legacystack comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp b9174 (vulkan)baselinechat1
220.3
211.91068.234ms4.536100472ms0.000 GiB
legacystack comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp b9174 (vulkan)baselinecodegen1
218.7
216.71883.838ms4.67110004.62s0.000 GiB
legacystack comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp b9174 (vulkan)baselinerag1
217.6
206.512257.263ms4.6853200969ms0.000 GiB
legacystack comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp b9174 (vulkan)baselineagent1
212.6
209.520907.330ms4.76185002.39s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselinechat1
202.7
193.1827.244ms4.936100518ms0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselinerag1
199.5
181.17332.5116ms5.08532001.10s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselinecodegen1
197.2
195.11386.859ms5.17110005.13s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselineagent1
195.1
189.37365.784ms5.16185002.64s0.000 GiB
legacystack comparable
GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595
llama.cpp b9174 (vulkan)baselineagent4
121.9
97.6776.0987ms8.26185005.12s0.000 GiB
legacystack comparable
GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590
llama.cpp 59778f0 (cuda)baselineagent4
101.6
83.5760.81.04s9.86185005.99s0.000 GiB
legacystack comparable
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp b8940 (rocm)baselinechat1
89.4
86.2415.387ms11.2361001.16s0.002 GiB
legacystack comparable
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp b8940 (rocm)baselinecodegen1
88.1
87.6687.2103ms11.371100011.41s0.001 GiB
legacystack comparable
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp b8940 (rocm)baselinerag1
87.0
76.32779.0364ms11.510122002.62s0.003 GiB
legacystack comparable
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp b8940 (rocm)baselineagent1
85.7
83.76107.1101ms11.76185005.97s0.003 GiB
legacystack comparable
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp b8940 (rocm)baselineagent4
33.3
29.4331.12.06s30.161850017.02s0.008 GiB

Environment

GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power350 W / 450 W max(78% cap)
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1800/2100 MHz · mem 9501 MHz
temp43°C idle · 59°C peak
peak draw316 W
hardware probes
copy 42% of theoryFP16 peak 65.4 TFcopy/math flat across caps
384-bit9751 MHz82 SM/CU
Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.
captheorycopyfp16bf16
200 W936 GB/s391 GB/s65.4 TF65.4 TF
300 W936 GB/s391 GB/s65.4 TF65.3 TF
450 W936 GB/s391 GB/s65.4 TF65.4 TF
compute: 8.6
backendllama.cpp cuda-4f13cb7 (cuda)
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driverNVIDIA 590.48.01 + CUDA 13.1
libc2.39
python3.12.3
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power450 W / 450 W max
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1800/2100 MHz · mem 9501 MHz
temp55°C idle · 65°C peak
peak draw319 W
backendllama.cpp cuda-4f13cb7 (cuda)
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driverNVIDIA 590.48.01 + CUDA 13.1
libc2.39
python3.12.3
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power200 W / 450 W max(44% cap)
backendllama.cpp 59778f0 (cuda)
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driver590.48.01
python3.12.3
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
cpuAMD RYZEN AI MAX+ 395 w/ Radeon 8060S
gpuAMD Radeon 8060S
archStrix Halo (gfx1151)
vram96 GiB (system 31.1 GiB, unified)
hardware probes
copy 41% of theoryFP16 peak 30.3 TF
256-bit8000 MHz20 SM/CU
Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.
captheorycopyfp16bf16
fixed256 GB/s106 GB/s30.3 TF-
compute: 11.5
backendllama.cpp b8940 (rocm)
osUbuntu 24.04.4 LTS
kernel7.0.2-2-pve
python3.12.3
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 5070 · 12 GiB
cpuAMD Ryzen 9 7900 12-Core Processor
gpuNVIDIA GeForce RTX 5070
archNVIDIA
vram11.94 GiB (system 30.4 GiB)
power250 W / 300 W max(83% cap)
hardware probes
copy 40% of theoryFP16 peak 69.6 TFcopy/math spread 2.5%
192-bit14001 MHz48 SM/CU
Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.
captheorycopyfp16bf16
200 W672 GB/s271 GB/s67.9 TF68.4 TF
250 W672 GB/s271 GB/s69.5 TF68.2 TF
300 W672 GB/s270 GB/s69.6 TF68.4 TF
compute: 12
backendllama.cpp b9174 (vulkan)
osCachyOS
kernel7.0.0-1-cachyos
driver595.58.03
python3.14.4
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue