Gemma-4 31B-it

Q4_K_M·31B params·GGUF

intelligence: see on Artificial Analysis →

checkpoint: unsloth/gemma-4-31B-it-GGUF:Q4_K_M

All runs (10)


legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline	chat	1	37.5	34.3	132.5	—	264ms	26.7	—	36	100	2.91s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline	codegen	1	36.5	34.9	90.7	—	847ms	27.4	—	71	1000	28.68s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline	agent	1	36.2	31.4	275.8	—	1.90s	27.7	—	618	500	15.93s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline	agent	4	36.1	13.3	29.7	—	24.88s	27.7	—	618	500	38.84s	0.000 GiB
legacy	stack comparable	GeForce RTX 3090 · 24 GiBcap 300 Wdrv 590	llama.cpp cuda-4f13cb7 (cuda)	baseline	rag	1	36.1	24.5	259.7	—	2.43s	27.7	—	853	200	8.16s	0.000 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp b1203 (rocm)	baseline	chat	1	10.6	9.9	—	—	740ms	94.4	—	—	97	9.80s	0.002 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp b1203 (rocm)	baseline	codegen	1	10.3	10.2	—	—	989ms	97.0	—	—	996	97.56s	0.003 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp b1203 (rocm)	baseline	agent	1	10.2	9.6	—	—	3.26s	98.5	—	—	497	51.86s	0.010 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp b1203 (rocm)	baseline	rag	1	10.0	9.1	—	—	2.34s	99.5	—	—	197	21.55s	0.007 GiB
legacy	stack comparable	Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified	llama.cpp b1203 (rocm)	baseline	agent	4	3.3	3.1	—	—	14.37s	306.3	—	—	497	162.75s	0.016 GiB

Environment

GeForce RTX 3090 · 24 GiB

cpuAMD EPYC 7302P 16-Core Processor

gpuNVIDIA GeForce RTX 3090

archNVIDIA

vram24 GiB (system 64.0 GiB)

power300 W / 450 W max(67% cap)

pcieGen 4 x16 / Gen 4 x16 max

clocksgfx 1920/2100 MHz · mem 9501 MHz

temp53°C idle · 68°C peak

peak draw287 W

hardware probes

copy 42% of theoryFP16 peak 65.4 TFcopy/math flat across caps

384-bit9751 MHz82 SM/CU

Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.

cap	theory	copy	fp16	bf16
200 W	936 GB/s	391 GB/s	65.4 TF	65.4 TF
300 W	936 GB/s	391 GB/s	65.4 TF	65.3 TF
450 W	936 GB/s	391 GB/s	65.4 TF	65.4 TF

compute: 8.6

backendllama.cpp cuda-4f13cb7 (cuda)

osUbuntu 24.04 LTS

kernel6.17.13-7-pve

driverNVIDIA 590.48.01 + CUDA 13.1

libc2.39

python3.12.3

llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64

build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release

runs/cell5

warmups2

endpoint/v1/chat/completions

streamingtrue

Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)

cpuAMD RYZEN AI MAX+ 395 w/ Radeon 8060S

gpuAMD Radeon 8060S

archStrix Halo (gfx1151)

vram96 GiB (system 31.1 GiB, unified)

hardware probes

copy 41% of theoryFP16 peak 30.3 TF

256-bit8000 MHz20 SM/CU

Microbenchmarks for memory copy and tensor math; raw-engine decode and API workload rows measure model-serving speed.

cap	theory	copy	fp16	bf16
fixed	256 GB/s	106 GB/s	30.3 TF	-

compute: 11.5

backendllama.cpp b1203 (rocm)

osUbuntu 24.04.4 LTS

kernel7.0.2-2-pve

python3.12.3

runs/cell3

warmups1

endpoint/v1/chat/completions

streamingtrue