Gemma-3 4b-it

Q4_K_M·4B params·GGUF
checkpoint: unsloth/gemma-3-4b-it-GGUF:gemma-3-4b-it-Q4_K_M.gguf

All runs (40)

HardwareBackendModeShapeConc.Gen tok/sPrefill tok/sTTFTTPOT (ms)Prompt tokOut tokTotalVRAM Δ
GeForce RTX 3090 · 24 GiB450 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wcodegen1
169.4
539.9129ms5.56410005.90s0.000 GiB
GeForce RTX 5070 · 12 GiB250 Wdrv 595
llama.cpp cuda-1a68ec9 (cuda)baselinecodegen1
168.7
2014.935ms5.96410005.93s0.000 GiB
GeForce RTX 3090 · 24 GiB450 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wchat1
167.9
788.538ms5.429100573ms0.000 GiB
GeForce RTX 5070 · 12 GiB250 Wdrv 595
llama.cpp cuda-1a68ec9 (cuda)baselinechat1
166.9
1377.821ms5.829100595ms0.000 GiB
GeForce RTX 3090 · 24 GiB350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wcodegen1
166.7
502.5127ms5.66410006.00s0.000 GiB
GeForce RTX 3090 · 24 GiB350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wchat1
164.4
816.338ms5.529100582ms0.000 GiB
GeForce RTX 5070 · 12 GiB250 Wdrv 595
llama.cpp cuda-1a68ec9 (cuda)baselineagent1
163.7
10203.960ms5.96113762.29s0.000 GiB
GeForce RTX 3090 · 24 GiB300 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baselinecodegen1
160.9
531.5123ms5.76410006.21s0.000 GiB
GeForce RTX 3090 · 24 GiB300 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baselinechat1
160.0
807.938ms5.629100597ms0.000 GiB
GeForce RTX 3090 · 24 GiB350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent1
153.2
2267.3228ms5.66114362.81s0.000 GiB
GeForce RTX 3090 · 24 GiB450 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent1
152.8
2316.2223ms5.66114362.77s0.000 GiB
GeForce RTX 3090 · 24 GiB300 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baselineagent1
143.7
2266.9228ms5.86114362.84s0.000 GiB
GeForce RTX 5070 · 12 GiB250 Wdrv 595
llama.cpp cuda-1a68ec9 (cuda)baselinerag1
143.0
10212.283ms5.984667532ms0.000 GiB
GeForce RTX 3090 · 24 GiB200 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-200wchat1
128.5
652.145ms7.029100742ms0.000 GiB
GeForce RTX 3090 · 24 GiB200 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-200wcodegen1
120.9
540.1127ms7.86410008.27s0.000 GiB
GeForce RTX 3090 · 24 GiB200 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-200wagent1
111.1
2080.5249ms7.86114363.91s0.000 GiB
GeForce RTX 3090 · 24 GiB350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wrag1
101.6
1932.8387ms5.684670792ms0.000 GiB
GeForce RTX 3090 · 24 GiB300 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baselinerag1
97.1
2050.5335ms5.784670701ms0.000 GiB
GeForce RTX 3090 · 24 GiB450 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wrag1
95.5
1871.9339ms5.584670688ms0.000 GiB
GeForce RTX 3090 · 24 GiB200 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-200wrag1
84.0
1908.2340ms7.584670803ms0.000 GiB
GeForce RTX 5070 · 12 GiB250 Wdrv 595
llama.cpp b9174 (vulkan)baselineagent4
70.6
539.31.62s11.36114375.65s0.000 GiB
GeForce RTX 3090 · 24 GiB450 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-450wagent4
65.9
152.13.87s5.66114366.77s0.000 GiB
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp b1203 (rocm)baselinecodegen1
64.6
99ms15.4100015.47s0.002 GiB
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp b1203 (rocm)baselinechat1
64.2
59ms15.11001.56s0.001 GiB
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp b1203 (rocm)baselineagent1
61.3
426ms15.53545.97s0.002 GiB
GeForce RTX 3090 · 24 GiB300 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baselineagent4
60.8
153.64.20s5.86114367.07s0.000 GiB
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp b1203 (rocm)baselinerag1
55.5
325ms15.7671.60s0.002 GiB
GeForce RTX 3090 · 24 GiB350 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-350wagent4
54.6
168.74.16s5.66114366.39s0.000 GiB
GeForce RTX 3090 · 24 GiB200 Wdrv 590
llama.cpp cuda-4f13cb7 (cuda)baseline-pl-200wagent4
53.0
121.75.41s7.86114369.51s0.000 GiB
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified
llama.cpp b1203 (rocm)baselineagent4
17.8
3.36s45.037621.06s0.001 GiB

Environment

GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power200 W / 450 W max(44% cap)
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1965/2100 MHz · mem 9501 MHz
temp44°C idle · 46°C peak
peak draw196 W
backendllama.cpp cuda-4f13cb7 (cuda)
serverlemonade unknown
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driverNVIDIA 590.48.01 + CUDA 13.1
libc2.39
python3.12.3
containerizedtrue
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power350 W / 450 W max(78% cap)
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1980/2100 MHz · mem 9501 MHz
temp54°C idle · 62°C peak
peak draw335 W
backendllama.cpp cuda-4f13cb7 (cuda)
serverlemonade unknown
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driverNVIDIA 590.48.01 + CUDA 13.1
libc2.39
python3.12.3
containerizedtrue
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power450 W / 450 W max
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1965/2100 MHz · mem 9501 MHz
temp60°C idle · 77°C peak
peak draw433 W
backendllama.cpp cuda-4f13cb7 (cuda)
serverlemonade unknown
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driverNVIDIA 590.48.01 + CUDA 13.1
libc2.39
python3.12.3
containerizedtrue
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power300 W / 450 W max(67% cap)
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1950/2100 MHz · mem 9501 MHz
temp37°C idle · 64°C peak
peak draw291 W
backendllama.cpp cuda-4f13cb7 (cuda)
serverlemonade unknown
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driverNVIDIA 590.48.01 + CUDA 13.1
libc2.39
python3.12.3
containerizedtrue
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 5070 · 12 GiB
cpuAMD Ryzen 9 7900 12-Core Processor
gpuNVIDIA GeForce RTX 5070
archNVIDIA
vram11.94 GiB (system 30.5 GiB)
power250 W / 300 W max(83% cap)
pcieGen 1 x16 / Gen 4 x16 max
clocksgfx 180/3090 MHz · mem 405 MHz
temp31°C idle · 64°C peak
peak draw194 W
backendllama.cpp cuda-1a68ec9 (cuda)
serverlemonade unknown
osCachyOS
kernel7.0.8-1-cachyos
driverNVIDIA 595.71.05 + CUDA 13.2
libc2.43
python3.14.4
containerizedfalse
build flagsGGML_CUDA=ON CMAKE_CUDA_ARCHITECTURES=120 CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
cpuAMD RYZEN AI MAX+ 395 w/ Radeon 8060S
gpuAMD Radeon 8060S
archStrix Halo (gfx1151)
vram96 GiB (system 31.1 GiB, unified)
backendllama.cpp b1203 (rocm)
serverlemonade 10.4.0
osUbuntu 24.04.4 LTS
kernel7.0.2-2-pve
python3.12.3
containerizedtrue
runs/cell3
warmups1
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 5070 · 12 GiB
cpuAMD Ryzen 9 7900 12-Core Processor
gpuNVIDIA GeForce RTX 5070
archNVIDIA
vram11.94 GiB (system 30.5 GiB)
power250 W / 300 W max(83% cap)
pcieGen 1 x16 / Gen 4 x16 max
clocksgfx 180/3090 MHz · mem 405 MHz
temp39°C idle · 62°C peak
peak draw175 W
backendllama.cpp vulkan-1a68ec9 (vulkan)
serverlemonade unknown
osCachyOS
kernel7.0.8-1-cachyos
driverNVIDIA 595.71.05 + CUDA 13.2
libc2.43
python3.14.4
containerizedfalse
llama.cppversion: 1 (1a68ec9) built with GNU 15.2.1 for Linux x86_64
build flagsGGML_VULKAN=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 5070 · 12 GiB
cpuAMD Ryzen 9 7900 12-Core Processor
gpuNVIDIA GeForce RTX 5070
archNVIDIA
vram11.94 GiB (system 30.4 GiB)
power250 W / 300 W max(83% cap)
backendllama.cpp b9174 (vulkan)
serverlemonade unknown
osCachyOS
kernel7.0.0-1-cachyos
driver595.58.03
python3.14.4
containerizedfalse
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue