Gemma-4 31B-it
Q4_K_M·31B params·GGUF
intelligence: see on Artificial Analysis →
checkpoint:
unsloth/gemma-4-31B-it-GGUF:Q4_K_MAll runs (10)
| Hardware | Backend | Mode | Shape | Conc. | Gen tok/s ↓ | Prefill tok/s | TTFT | TPOT (ms) | Prompt tok | Out tok | Total | VRAM Δ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
GeForce RTX 3090 · 24 GiB300 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline | codegen | 1 | 34.9 | 90.7 | 847ms | 27.4 | 71 | 1000 | 28.68s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB300 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline | chat | 1 | 34.3 | 132.5 | 264ms | 26.7 | 36 | 100 | 2.91s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB300 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline | agent | 1 | 31.4 | 275.8 | 1.90s | 27.7 | 618 | 500 | 15.93s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB300 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline | rag | 1 | 24.5 | 259.7 | 2.43s | 27.7 | 853 | 200 | 8.16s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB300 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline | agent | 4 | 13.3 | 29.7 | 24.88s | 27.7 | 618 | 500 | 38.84s | 0.000 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b1203 (rocm) | baseline | codegen | 1 | 10.2 | — | 989ms | 97.0 | — | 996 | 97.56s | 0.003 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b1203 (rocm) | baseline | chat | 1 | 9.9 | — | 740ms | 94.4 | — | 97 | 9.80s | 0.002 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b1203 (rocm) | baseline | agent | 1 | 9.6 | — | 3.26s | 98.5 | — | 497 | 51.86s | 0.010 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b1203 (rocm) | baseline | rag | 1 | 9.1 | — | 2.34s | 99.5 | — | 197 | 21.55s | 0.007 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp b1203 (rocm) | baseline | agent | 4 | 3.1 | — | 14.37s | 306.3 | — | 497 | 162.75s | 0.016 GiB |
Environment
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power300 W / 450 W max(67% cap)
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1920/2100 MHz · mem 9501 MHz
temp53°C idle · 68°C peak
peak draw287 W
backendllama.cpp cuda-4f13cb7 (cuda)
serverlemonade unknown
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driverNVIDIA 590.48.01 + CUDA 13.1
libc2.39
python3.12.3
containerizedtrue
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
cpuAMD RYZEN AI MAX+ 395 w/ Radeon 8060S
gpuAMD Radeon 8060S
archStrix Halo (gfx1151)
vram96 GiB (system 31.1 GiB, unified)
backendllama.cpp b1203 (rocm)
serverlemonade 10.4.0
osUbuntu 24.04.4 LTS
kernel7.0.2-2-pve
python3.12.3
containerizedtrue
runs/cell3
warmups1
endpoint/v1/chat/completions
streamingtrue