Gemma-4 E2B-it

Q4_K_M·2B params·GGUF

intelligence: see on Artificial Analysis →

checkpoint: unsloth/gemma-4-E2B-it-GGUF:Q4_K_M

commit: 90f961834039

weights 2.89 GiB

All runs (25)


legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	chat	1	226.4	205.2	872.9	—	41ms	4.4	—	36	100	487ms	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	chat	1	226.2	192.3	915.1	—	40ms	4.4	—	36	100	520ms	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	rag	1	224.5	196.0	11135.7	—	83ms	4.5	—	853	200	1.02s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	codegen	1	224.3	211.7	1275.4	—	56ms	4.5	—	71	1000	4.72s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	4	223.7	87.6	216.2	—	3.63s	4.5	—	618	500	5.92s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	codegen	1	223.7	207.5	1309.8	—	54ms	4.5	—	71	1000	4.82s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 350 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-350w	agent	1	223.3	208.2	13621.0	—	51ms	4.5	—	618	500	2.40s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	4	223.0	87.3	226.1	—	3.64s	4.5	—	618	500	5.93s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	agent	1	222.5	204.3	12671.7	—	49ms	4.5	—	618	500	2.45s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiB450 W maxdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline-pl-450w	rag	1	222.2	194.9	10860.7	—	82ms	4.5	—	853	200	1.03s	0.000 GiB
legacy	stack comparable	GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595	llama.cpp b9174 (vulkan)	baseline	chat	1	220.3	211.9	1068.2	—	34ms	4.5	—	36	100	472ms	0.000 GiB
legacy	stack comparable	GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595	llama.cpp b9174 (vulkan)	baseline	codegen	1	218.7	216.7	1883.8	—	38ms	4.6	—	71	1000	4.62s	0.000 GiB
legacy	stack comparable	GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595	llama.cpp b9174 (vulkan)	baseline	rag	1	217.6	206.5	12257.2	—	63ms	4.6	—	853	200	969ms	0.000 GiB
legacy	stack comparable	GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595	llama.cpp b9174 (vulkan)	baseline	agent	1	212.6	209.5	20907.3	—	30ms	4.7	—	618	500	2.39s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	chat	1	202.7	193.1	827.2	—	44ms	4.9	—	36	100	518ms	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	rag	1	199.5	181.1	7332.5	—	116ms	5.0	—	853	200	1.10s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	codegen	1	197.2	195.1	1386.8	—	59ms	5.1	—	71	1000	5.13s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	agent	1	195.1	189.3	7365.7	—	84ms	5.1	—	618	500	2.64s	0.000 GiB
legacy	stack comparable	GeForce RTX 5070 · 12 GiBcap 250 Wdrv 595	llama.cpp b9174 (vulkan)	baseline	agent	4	121.9	97.6	776.0	—	987ms	8.2	—	618	500	5.12s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 200 Wdrv 590	llama.cpp 59778f0 (cuda)	baseline	agent	4	101.6	83.5	760.8	—	1.04s	9.8	—	618	500	5.99s	0.000 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp b8940 (rocm)	baseline	chat	1	89.4	86.2	415.3	—	87ms	11.2	—	36	100	1.16s	0.002 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp b8940 (rocm)	baseline	codegen	1	88.1	87.6	687.2	—	103ms	11.3	—	71	1000	11.41s	0.001 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp b8940 (rocm)	baseline	rag	1	87.0	76.3	2779.0	—	364ms	11.5	—	1012	200	2.62s	0.003 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp b8940 (rocm)	baseline	agent	1	85.7	83.7	6107.1	—	101ms	11.7	—	618	500	5.97s	0.003 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp b8940 (rocm)	baseline	agent	4	33.3	29.4	331.1	—	2.06s	30.1	—	618	500	17.02s	0.008 GiB

Environment

GeForce RTX 3090 · 24 GiB

cpuAMD EPYC 7302P 16-Core Processor

gpuNVIDIA GeForce RTX 3090

archNVIDIA

vram24 GiB (system 64.0 GiB)

power350 W / 450 W max(78% cap)

pcieGen 4 x16 / Gen 4 x16 max

clocksgfx 1800/2100 MHz · mem 9501 MHz

temp43°C idle · 59°C peak

peak draw316 W

hardware probes

copy 42% of theoryFP16 peak 65.4 TFcopy/math flat across caps

384-bit9751 MHz82 SM/CU

Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.

cap	theory	copy	fp16	bf16
200 W	936 GB/s	391 GB/s	65.4 TF	65.4 TF
300 W	936 GB/s	391 GB/s	65.4 TF	65.3 TF
450 W	936 GB/s	391 GB/s	65.4 TF	65.4 TF

compute: 8.6

backendllama.cpp cuda-4f13cb7 (cuda)

osUbuntu 24.04 LTS

kernel6.17.13-7-pve

driverNVIDIA 590.48.01 + CUDA 13.1

libc2.39

python3.12.3

llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64

build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release

runs/cell5

warmups2

endpoint/v1/chat/completions

streamingtrue

GeForce RTX 3090 · 24 GiB

cpuAMD EPYC 7302P 16-Core Processor

gpuNVIDIA GeForce RTX 3090

archNVIDIA

vram24 GiB (system 64.0 GiB)

power450 W / 450 W max

pcieGen 4 x16 / Gen 4 x16 max

clocksgfx 1800/2100 MHz · mem 9501 MHz

temp55°C idle · 65°C peak

peak draw319 W

backendllama.cpp cuda-4f13cb7 (cuda)

osUbuntu 24.04 LTS

kernel6.17.13-7-pve

driverNVIDIA 590.48.01 + CUDA 13.1

libc2.39

python3.12.3

llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64

build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release

runs/cell5

warmups2

endpoint/v1/chat/completions

streamingtrue

GeForce RTX 3090 · 24 GiB

cpuAMD EPYC 7302P 16-Core Processor

gpuNVIDIA GeForce RTX 3090

archNVIDIA

vram24 GiB (system 64.0 GiB)

power200 W / 450 W max(44% cap)

backendllama.cpp 59778f0 (cuda)

osUbuntu 24.04 LTS

kernel6.17.13-7-pve

driver590.48.01

python3.12.3

runs/cell5

warmups2

endpoint/v1/chat/completions

streamingtrue

Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

cpuAMD RYZEN AI MAX+ 395 w/ Radeon 8060S

gpuAMD Radeon 8060S

archStrix Halo (gfx1151)

vram96 GiB (system 31.1 GiB, unified)

hardware probes

copy 41% of theoryFP16 peak 30.3 TF

256-bit8000 MHz20 SM/CU

Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.

cap	theory	copy	fp16	bf16
fixed	256 GB/s	106 GB/s	30.3 TF	-

compute: 11.5

backendllama.cpp b8940 (rocm)

osUbuntu 24.04.4 LTS

kernel7.0.2-2-pve

python3.12.3

runs/cell5

warmups2

endpoint/v1/chat/completions

streamingtrue

GeForce RTX 5070 · 12 GiB

cpuAMD Ryzen 9 7900 12-Core Processor

gpuNVIDIA GeForce RTX 5070

archNVIDIA

vram11.94 GiB (system 30.4 GiB)

power250 W / 300 W max(83% cap)

hardware probes

copy 40% of theoryFP16 peak 69.6 TFcopy/math spread 2.5%

192-bit14001 MHz48 SM/CU

Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.

cap	theory	copy	fp16	bf16
200 W	672 GB/s	271 GB/s	67.9 TF	68.4 TF
250 W	672 GB/s	271 GB/s	69.5 TF	68.2 TF
300 W	672 GB/s	270 GB/s	69.6 TF	68.4 TF

compute: 12

backendllama.cpp b9174 (vulkan)

osCachyOS

kernel7.0.0-1-cachyos

driver595.58.03

python3.14.4

runs/cell5

warmups2

endpoint/v1/chat/completions

streamingtrue