NVIDIA-Nemotron-3-Nano-Omni 30B-A3B-Reasoning
Q4_K_M·30B params·GGUF
intelligence: see on Artificial Analysis →
checkpoint:
unsloth/NVIDIA-Nemotron-3-Nano-Omni-30B-A3B-Reasoning-GGUF:Q4_K_MAll runs (5)
| Hardware | Backend | Shape | Conc. | Gen tok/s ↓ | TTFT | TPOT (ms) | Out tok | Total | VRAM Δ |
|---|---|---|---|---|---|---|---|---|---|
| GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | codegen | 1 | 134.2 | 217ms | 7.0 | 1000 | 7.45s | 0.010 GiB |
| GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | chat | 1 | 123.5 | 156ms | 6.5 | 100 | 810ms | 0.000 GiB |
| GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | agent | 1 | 121.5 | 462ms | 7.0 | 500 | 4.12s | 0.000 GiB |
| GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | rag | 1 | 109.6 | 439ms | 6.7 | 200 | 1.82s | 0.000 GiB |
| GeForce RTX 3090 · 24 GiB | llama.cpp 59778f0 (cuda) | agent | 4 | 1.2 | 678ms | 0.0 | 1 | 1.08s | 0.000 GiB |
Environment
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power200 W / 450 W max(44% cap)
backendllama.cpp 59778f0 (cuda)
serverlemonade unknown
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driver590.48.01
python3.12.3
containerizedtrue
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue