Qwen3.6 27B-MTP
Q4_K_M·27B params·GGUF
reasoning
intelligence: see on Artificial Analysis →
checkpoint:
unsloth/Qwen3.6-27B-MTP-GGUF:Q4_K_MAll runs (127)
| Hardware | Backend | Mode | Shape | Conc. | Gen tok/s ↓ | Prefill tok/s | TTFT | TPOT (ms) | Prompt tok | Out tok | Total | VRAM Δ |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | MTP n=2 | codegen | 1 | 63.7 | 195.4 | 335ms | 0.1 | 62 | 1000 | 15.69s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | MTP n=2 | agent | 1 | 61.7 | 1157.5 | 525ms | 0.1 | 599 | 500 | 8.11s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | MTP n=2 | chat | 1 | 59.5 | 119.7 | 259ms | 0.1 | 30 | 100 | 1.68s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | MTP n=3 | codegen | 1 | 59.2 | 175.8 | 372ms | 0.1 | 62 | 1000 | 16.88s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | MTP n=3 | chat | 1 | 58.7 | 123.8 | 259ms | 0.1 | 30 | 100 | 1.70s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | MTP n=3 | agent | 1 | 57.8 | 1208.4 | 537ms | 0.1 | 599 | 500 | 8.66s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | MTP n=2 | rag | 1 | 55.9 | 1186.1 | 879ms | 0.1 | 842 | 200 | 3.58s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | MTP n=3 | rag | 1 | 54.6 | 1096.3 | 934ms | 0.1 | 842 | 200 | 3.66s | 0.000 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | MTP n=3 | ctx1k_answer | 1 | 50.2 | 656.8 | 1.48s | 0.1 | 969 | 500 | 9.96s | 0.000 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | MTP n=2 | ctx1k_answer | 1 | 47.5 | 671.1 | 1.45s | 0.1 | 969 | 500 | 10.54s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline | codegen | 1 | 40.4 | 188.9 | 355ms | 23.7 | 62 | 1000 | 24.76s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline | agent | 1 | 39.3 | 1196.7 | 505ms | 23.8 | 599 | 500 | 12.73s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline | chat | 1 | 38.7 | 127.0 | 238ms | 23.4 | 30 | 100 | 2.58s | 0.000 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | MTP n=3 | ctx4k_answer | 1 | 38.0 | 827.9 | 4.57s | 0.1 | 3778 | 500 | 13.15s | 0.000 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | MTP n=2 | ctx4k_answer | 1 | 36.7 | 823.7 | 4.59s | 0.1 | 3778 | 500 | 13.62s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline | rag | 1 | 35.6 | 1303.0 | 738ms | 23.6 | 842 | 200 | 5.62s | 0.000 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | baseline | ctx1k_answer | 1 | 35.2 | 824.9 | 1.18s | 25.3 | 969 | 500 | 14.21s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 4f13cb7-mtp (cuda) | mtp-3-pl-200w | chat | 1 | 34.2 | 109.6 | 283ms | 0.1 | 30 | 100 | 2.92s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 4f13cb7-mtp (cuda) | mtp-2-pl-200w | chat | 1 | 32.0 | 113.4 | 271ms | 0.1 | 30 | 100 | 3.12s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 4f13cb7-mtp (cuda) | mtp-2-pl-200w | codegen | 1 | 31.8 | 185.9 | 377ms | 0.1 | 62 | 1000 | 31.41s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 4f13cb7-mtp (cuda) | mtp-3-pl-200w | rag | 1 | 31.2 | 856.1 | 1.05s | 0.1 | 842 | 200 | 6.40s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 4f13cb7-mtp (cuda) | mtp-3-pl-200w | agent | 1 | 31.2 | 964.4 | 621ms | 0.1 | 599 | 500 | 16.03s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 4f13cb7-mtp (cuda) | mtp-3-pl-200w | codegen | 1 | 31.1 | 182.4 | 384ms | 0.1 | 62 | 1000 | 32.18s | 0.000 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | baseline | ctx4k_answer | 1 | 30.8 | 1238.8 | 3.05s | 25.5 | 3778 | 500 | 16.22s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 4f13cb7-mtp (cuda) | mtp-2-pl-200w | agent | 1 | 30.4 | 972.1 | 616ms | 0.0 | 599 | 500 | 16.44s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 4f13cb7-mtp (cuda) | mtp-2-pl-200w | rag | 1 | 29.8 | 860.6 | 1.05s | 0.0 | 842 | 200 | 6.71s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | MTP n=2 | agent | 4 | 25.3 | 57.3 | 12.42s | 0.1 | 599 | 500 | 20.39s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | MTP n=3 | agent | 4 | 24.0 | 59.7 | 13.43s | 0.1 | 599 | 500 | 21.54s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 4f13cb7-mtp (cuda) | baseline-pl-200w | codegen | 1 | 21.5 | 191.7 | 345ms | 45.9 | 62 | 1000 | 46.44s | 0.000 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | MTP n=3 | ctx1k_answer | 1 | 21.3 | 286.6 | 3.38s | 0.0 | 969 | 500 | 23.50s | 0.013 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp 4f13cb7-mtp (rocm) | MTP n=3 | chat | 1 | 21.2 | 78.2 | 386ms | 3.0 | 30 | 100 | 4.72s | 0.006 GiB |
GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 4f13cb7-mtp (cuda) | baseline-pl-200w | chat | 1 | 21.1 | 112.9 | 266ms | 43.9 | 30 | 100 | 4.74s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 4f13cb7-mtp (cuda) | baseline-pl-200w | agent | 1 | 21.0 | 984.2 | 609ms | 46.2 | 599 | 500 | 23.77s | 0.000 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp 4f13cb7-mtp (rocm) | MTP n=3 | codegen | 1 | 20.6 | 133.5 | 475ms | 0.1 | 62 | 1000 | 48.47s | 0.041 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp 4f13cb7-mtp (rocm) | MTP n=3 | agent | 1 | 20.4 | 1984.4 | 302ms | 0.0 | 599 | 500 | 24.51s | 0.022 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | baseline | ctx16k_answer | 1 | 20.2 | 1344.6 | 11.27s | 26.6 | 15154 | 500 | 24.81s | 0.000 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp 4f13cb7-mtp (rocm) | MTP n=2 | codegen | 1 | 20.0 | 136.7 | 460ms | 0.1 | 62 | 1000 | 49.96s | 0.044 GiB |
GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 4f13cb7-mtp (cuda) | baseline-pl-200w | rag | 1 | 20.0 | 1035.7 | 923ms | 44.9 | 842 | 200 | 10.01s | 0.000 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp 4f13cb7-mtp (rocm) | MTP n=3 | rag | 1 | 19.9 | 426.2 | 1.69s | 0.0 | 842 | 200 | 10.05s | 0.011 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp 4f13cb7-mtp (rocm) | MTP n=2 | chat | 1 | 19.7 | 81.6 | 375ms | 0.0 | 30 | 100 | 5.07s | 0.006 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp 4f13cb7-mtp (rocm) | MTP n=2 | agent | 1 | 19.4 | 1949.4 | 315ms | 0.0 | 599 | 500 | 25.73s | 0.024 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | MTP n=2 | ctx1k_answer | 1 | 19.3 | 287.4 | 3.37s | 0.0 | 969 | 500 | 25.91s | 0.013 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp 4f13cb7-mtp (rocm) | MTP n=2 | rag | 1 | 18.8 | 432.5 | 1.68s | 0.0 | 842 | 200 | 10.66s | 0.012 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | MTP n=2 | ctx16k_answer | 1 | 17.2 | 791.3 | 19.15s | 0.1 | 15154 | 500 | 29.01s | 0.000 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | MTP n=3 | ctx16k_answer | 1 | 17.2 | 790.1 | 19.18s | 0.1 | 15154 | 500 | 29.05s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB450 Wdrv 590 | llama.cpp cuda-4f13cb7 (cuda) | baseline | agent | 4 | 16.2 | 35.9 | 19.73s | 23.8 | 599 | 500 | 32.01s | 0.000 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | MTP n=3 | ctx4k_answer | 1 | 15.3 | 303.8 | 12.44s | 0.0 | 3778 | 500 | 32.64s | 0.012 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | MTP n=2 | ctx4k_answer | 1 | 14.1 | 302.7 | 12.48s | 0.0 | 3778 | 500 | 35.35s | 0.014 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | baseline | ctx32k_answer | 1 | 13.1 | 1269.6 | 23.87s | 28.0 | 30305 | 500 | 38.03s | 0.000 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp 4f13cb7-mtp (rocm) | baseline | codegen | 1 | 12.0 | 146.7 | 428ms | 82.7 | 62 | 1000 | 83.17s | 0.016 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp 4f13cb7-mtp (rocm) | baseline | agent | 1 | 12.0 | 2090.1 | 287ms | 82.8 | 599 | 500 | 41.69s | 0.009 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp 4f13cb7-mtp (rocm) | baseline | chat | 1 | 11.7 | 87.3 | 345ms | 82.4 | 30 | 100 | 8.52s | 0.003 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | baseline | ctx1k_answer | 1 | 11.3 | 319.1 | 3.04s | 82.8 | 969 | 500 | 44.42s | 0.005 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp 4f13cb7-mtp (rocm) | baseline | rag | 1 | 11.1 | 470.6 | 1.51s | 82.8 | 842 | 200 | 18.01s | 0.006 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | MTP n=3 | ctx32k_answer | 1 | 9.9 | 737.1 | 41.12s | 0.1 | 30305 | 500 | 50.70s | 0.000 GiB |
GeForce RTX 3090 · 24 GiB200 Wdrv 590 | llama.cpp 4f13cb7-mtp (cuda) | baseline-pl-200w | agent | 4 | 9.8 | — | 4.04s | 89.5 | — | 341 | 34.74s | 0.040 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | MTP n=2 | ctx32k_answer | 1 | 9.8 | 739.3 | 40.99s | 0.1 | 30305 | 500 | 51.28s | 0.000 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | baseline | ctx4k_answer | 1 | 9.4 | 339.2 | 11.14s | 83.7 | 3778 | 500 | 53.00s | 0.006 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp 4f13cb7-mtp (rocm) | MTP n=3 | agent | 4 | 8.4 | 18.8 | 36.60s | 0.1 | 599 | 500 | 61.49s | 0.090 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp 4f13cb7-mtp (rocm) | MTP n=2 | agent | 4 | 8.2 | 21.9 | 38.32s | 0.1 | 599 | 500 | 63.14s | 0.097 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | baseline | ctx64k_answer | 1 | 7.1 | 1107.2 | 54.68s | 30.5 | 60601 | 500 | 70.34s | 0.000 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | MTP n=3 | ctx16k_answer | 1 | 6.4 | 280.1 | 54.12s | 0.0 | 15154 | 500 | 77.75s | 0.013 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | MTP n=2 | ctx16k_answer | 1 | 6.3 | 280.2 | 54.09s | 0.0 | 15154 | 500 | 79.44s | 0.014 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | baseline | ctx1k_probe | 1 | 6.3 | 870.2 | 1.09s | 24.9 | 951 | 8 | 1.27s | 0.000 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | baseline | ctx16k_answer | 1 | 5.4 | 311.1 | 48.72s | 86.7 | 15154 | 500 | 92.05s | 0.006 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | MTP n=3 | ctx1k_probe | 1 | 5.4 | 693.4 | 1.37s | 0.1 | 951 | 8 | 1.49s | 0.010 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | MTP n=2 | ctx1k_probe | 1 | 5.3 | 694.6 | 1.37s | 0.1 | 951 | 8 | 1.50s | 0.000 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unified | llama.cpp 4f13cb7-mtp (rocm) | baseline | agent | 4 | 5.0 | 12.6 | 62.69s | 82.8 | 599 | 500 | 104.08s | 0.037 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | MTP n=3 | ctx64k_answer | 1 | 4.7 | 631.4 | 95.98s | 0.1 | 60601 | 500 | 105.45s | 0.000 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | MTP n=2 | ctx64k_answer | 1 | 4.7 | 632.7 | 95.67s | 0.1 | 60601 | 500 | 105.88s | 0.000 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | baseline | ctx100k_answer | 1 | 4.2 | 937.4 | 100.91s | 34.4 | 94590 | 500 | 118.33s | 0.000 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | MTP n=3 | ctx32k_answer | 1 | 3.4 | 250.0 | 121.24s | 0.0 | 30305 | 500 | 147.14s | 0.013 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | MTP n=2 | ctx32k_answer | 1 | 3.4 | 249.5 | 121.48s | 0.1 | 30305 | 500 | 148.56s | 0.015 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | baseline | ctx32k_answer | 1 | 3.2 | 276.3 | 109.69s | 90.9 | 30305 | 500 | 155.13s | 0.005 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | baseline | ctx128k_answer | 1 | 3.1 | 852.8 | 142.02s | 36.5 | 121117 | 500 | 160.72s | 0.000 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | MTP n=3 | ctx100k_answer | 1 | 2.7 | 536.0 | 176.47s | 0.1 | 94590 | 500 | 187.99s | 0.000 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | MTP n=2 | ctx100k_answer | 1 | 2.7 | 535.0 | 176.82s | 0.1 | 94590 | 500 | 188.56s | 0.000 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | baseline | ctx4k_probe | 1 | 2.4 | 1256.4 | 3.05s | 27.4 | 3835 | 8 | 3.29s | 0.000 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | baseline | ctx1k_probe | 1 | 2.3 | 336.1 | 2.83s | 82.8 | 951 | 8 | 3.42s | 0.002 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | MTP n=3 | ctx1k_probe | 1 | 2.3 | 299.4 | 3.18s | 0.0 | 951 | 8 | 3.50s | 0.003 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | MTP n=2 | ctx1k_probe | 1 | 2.3 | 299.5 | 3.17s | 0.0 | 951 | 8 | 3.55s | 0.003 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | MTP n=3 | ctx128k_answer | 1 | 1.9 | 478.8 | 252.98s | 0.1 | 121117 | 500 | 263.24s | 0.000 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | MTP n=2 | ctx128k_answer | 1 | 1.9 | 476.8 | 254.00s | 0.1 | 121117 | 500 | 265.22s | 0.000 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | MTP n=3 | ctx4k_probe | 1 | 1.7 | 833.5 | 4.60s | 0.1 | 3835 | 8 | 4.72s | 0.000 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | MTP n=2 | ctx4k_probe | 1 | 1.7 | 821.7 | 4.67s | 0.1 | 3835 | 8 | 4.81s | 0.000 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | baseline | ctx64k_answer | 1 | 1.6 | 224.9 | 269.20s | 99.3 | 60601 | 500 | 318.82s | 0.005 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | MTP n=3 | ctx64k_answer | 1 | 1.5 | 204.8 | 295.96s | 0.0 | 60601 | 500 | 322.74s | 0.010 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | MTP n=2 | ctx64k_answer | 1 | 1.5 | 204.7 | 296.05s | 0.0 | 60601 | 500 | 326.27s | 0.009 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | baseline | ctx100k_answer | 1 | 0.9 | 185.8 | 509.02s | 108.6 | 94590 | 500 | 563.37s | 0.006 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | MTP n=3 | ctx100k_answer | 1 | 0.9 | 170.1 | 556.16s | 0.0 | 94590 | 500 | 585.80s | 0.010 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | MTP n=2 | ctx100k_answer | 1 | 0.8 | 169.4 | 558.28s | 0.0 | 94590 | 500 | 590.24s | 0.010 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | baseline | ctx16k_probe | 1 | 0.7 | 1350.7 | 11.21s | 26.5 | 15136 | 8 | 11.40s | 0.000 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | baseline | ctx4k_probe | 1 | 0.7 | 340.5 | 11.26s | 85.8 | 3835 | 8 | 11.89s | 0.004 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | baseline | ctx128k_answer | 1 | 0.6 | 163.8 | 739.48s | 115.9 | 121117 | 500 | 797.46s | 0.005 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | MTP n=3 | ctx4k_probe | 1 | 0.6 | 302.9 | 12.66s | 0.0 | 3835 | 8 | 12.98s | 0.002 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | MTP n=2 | ctx4k_probe | 1 | 0.6 | 304.1 | 12.61s | 3.6 | 3835 | 8 | 13.02s | 0.001 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | MTP n=3 | ctx128k_answer | 1 | 0.6 | 150.2 | 806.44s | 0.0 | 121117 | 500 | 837.93s | 0.010 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | MTP n=2 | ctx128k_answer | 1 | 0.6 | 149.5 | 810.10s | 0.0 | 121117 | 500 | 844.87s | 0.010 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | MTP n=3 | ctx16k_probe | 1 | 0.4 | 795.0 | 19.04s | 0.1 | 15136 | 8 | 19.16s | 0.000 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | MTP n=2 | ctx16k_probe | 1 | 0.4 | 793.2 | 19.08s | 0.0 | 15136 | 8 | 19.24s | 0.000 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | baseline | ctx32k_probe | 1 | 0.3 | 1269.9 | 23.85s | 27.7 | 30287 | 8 | 24.05s | 0.000 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | MTP n=3 | ctx32k_probe | 1 | 0.2 | 739.3 | 40.97s | 0.1 | 30287 | 8 | 41.10s | 0.000 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | MTP n=2 | ctx32k_probe | 1 | 0.2 | 738.5 | 41.01s | 0.0 | 30287 | 8 | 41.17s | 0.000 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | baseline | ctx16k_probe | 1 | 0.2 | 312.4 | 48.44s | 87.2 | 15136 | 8 | 49.07s | 0.002 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | MTP n=2 | ctx16k_probe | 1 | 0.1 | 281.4 | 53.80s | 0.1 | 15136 | 8 | 54.24s | 0.002 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | MTP n=3 | ctx16k_probe | 1 | 0.1 | 279.9 | 54.08s | 0.0 | 15136 | 8 | 54.42s | 0.003 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | baseline | ctx64k_probe | 1 | 0.1 | 1098.8 | 55.14s | 31.2 | 60585 | 8 | 55.37s | 0.000 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | MTP n=3 | ctx64k_probe | 1 | 0.1 | 635.1 | 95.39s | 0.1 | 60585 | 8 | 95.58s | 0.000 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | MTP n=2 | ctx64k_probe | 1 | 0.1 | 633.4 | 95.65s | 0.0 | 60585 | 8 | 95.82s | 0.000 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | baseline | ctx100k_probe | 1 | 0.1 | 939.3 | 100.77s | 33.4 | 94645 | 8 | 101.01s | 0.000 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | baseline | ctx32k_probe | 1 | 0.1 | 276.3 | 109.63s | 91.1 | 30287 | 8 | 110.28s | 0.001 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | MTP n=2 | ctx32k_probe | 1 | 0.1 | 250.1 | 121.13s | 0.0 | 30287 | 8 | 121.58s | 0.003 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | MTP n=3 | ctx32k_probe | 1 | 0.1 | 249.8 | 121.26s | 0.0 | 30287 | 8 | 121.61s | 0.002 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | baseline | ctx128k_probe | 1 | 0.1 | 856.6 | 141.38s | 36.8 | 121099 | 8 | 141.66s | 0.000 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | MTP n=2 | ctx100k_probe | 1 | 0.0 | 537.7 | 176.03s | 0.1 | 94645 | 8 | 176.22s | 0.000 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | MTP n=3 | ctx100k_probe | 1 | 0.0 | 536.4 | 176.45s | 0.0 | 94645 | 8 | 176.66s | 0.000 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | MTP n=2 | ctx128k_probe | 1 | 0.0 | 477.9 | 253.42s | 0.1 | 121099 | 8 | 253.61s | 0.000 GiB |
2× GeForce RTX 3090 · 24 GiB each200 W × 2drv 595 | llama.cpp direct (cuda) | MTP n=3 | ctx128k_probe | 1 | 0.0 | 479.0 | 252.82s | 0.0 | 121099 | 8 | 252.96s | 0.000 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | baseline | ctx64k_probe | 1 | 0.0 | 224.8 | 269.46s | 99.5 | 60585 | 8 | 270.17s | 0.002 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | MTP n=2 | ctx64k_probe | 1 | 0.0 | 204.8 | 295.80s | 0.0 | 60585 | 8 | 296.33s | 0.003 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | MTP n=3 | ctx64k_probe | 1 | 0.0 | 204.8 | 295.90s | 0.0 | 60585 | 8 | 296.29s | 0.003 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | baseline | ctx100k_probe | 1 | 0.0 | 185.8 | 509.52s | 108.8 | 94645 | 8 | 510.31s | 0.002 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | MTP n=2 | ctx100k_probe | 1 | 0.0 | 169.7 | 557.77s | 0.0 | 94645 | 8 | 558.32s | 0.002 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | MTP n=3 | ctx100k_probe | 1 | 0.0 | 169.9 | 557.01s | 0.0 | 94645 | 8 | 557.61s | 0.003 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | baseline | ctx128k_probe | 1 | 0.0 | 163.6 | 740.26s | 115.9 | 121099 | 8 | 741.09s | 0.001 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | MTP n=2 | ctx128k_probe | 1 | 0.0 | 149.9 | 807.88s | 0.0 | 121099 | 8 | 808.42s | 0.003 GiB |
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)unifieddrv 7 | llama.cpp direct (rocm) | MTP n=3 | ctx128k_probe | 1 | 0.0 | 150.2 | 806.45s | 0.0 | 121099 | 8 | 806.93s | 0.002 GiB |
Environment
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power200 W / 450 W max(44% cap)
backendllama.cpp 4f13cb7-mtp (cuda)
serverlemonade unknown
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driver590.48.01
python3.12.3
containerizedtrue
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
GeForce RTX 3090 · 24 GiB
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090
archNVIDIA
vram24 GiB (system 64.0 GiB)
power450 W / 450 W max
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1980/2100 MHz · mem 9501 MHz
temp38°C idle · 83°C peak
peak draw436 W
backendllama.cpp cuda-4f13cb7 (cuda)
serverlemonade unknown
osUbuntu 24.04 LTS
kernel6.17.13-7-pve
driverNVIDIA 590.48.01 + CUDA 13.1
libc2.39
python3.12.3
containerizedtrue
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue
2× GeForce RTX 3090 · 24 GiB each
cpuAMD EPYC 7302P 16-Core Processor
gpuNVIDIA GeForce RTX 3090 × 2
archNVIDIA
vram48 GiB (system 64.0 GiB)
power200 W × 2 / 450 W × 2 max(44% cap)
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1800/2100 MHz · mem 9501 MHz
temp41°C idle · 53°C peak
peak draw195 W
backendllama.cpp direct (cuda)
serverlemonade unknown
osUbuntu 24.04 LTS
driverNVIDIA 595.71.05 + CUDA 13.2
libc2.39
python3.12.3
llama.cppversion: 18 (4f13cb7) built with GNU 13.3.0 for Linux x86_64
build flagsGGML_CUDA=ON CMAKE_BUILD_TYPE=Release
runs/cell3
warmups1
endpoint/v1/chat/completions
streamingtrue
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
cpuAMD RYZEN AI MAX+ 395 w/ Radeon 8060S
gpuAMD Radeon 8060S
archStrix Halo (gfx1151)
vram96 GiB (system 31.1 GiB, unified)
pcieGen 4 x16 / Gen 4 x16 max
clocksgfx 1158 MHz · mem 1000 MHz
temp47°C idle · 77°C peak
peak draw103 W
backendllama.cpp direct (rocm)
serverlemonade unknown
osUbuntu 24.04 LTS
driverROCm 7.2.3
libc2.39
python3.12.3
llama.cppversion: 1 (4f13cb7) built with Clang 22.0.0 for Linux x86_64
build flagsGGML_HIP=ON AMDGPU_TARGETS=gfx1151 CMAKE_BUILD_TYPE=Release
runs/cell3
warmups1
endpoint/v1/chat/completions
streamingtrue
Strix Halo · Radeon 8060S · 128 GiB unified (96 GiB VRAM)
cpuAMD RYZEN AI MAX+ 395 w/ Radeon 8060S
gpuAMD Radeon 8060S
archStrix Halo (gfx1151)
vram96 GiB (system 31.1 GiB, unified)
backendllama.cpp 4f13cb7-mtp (rocm)
serverlemonade unknown
osUbuntu 24.04 LTS
kernel7.0.2-2-pve
python3.12.3
containerizedtrue
runs/cell5
warmups2
endpoint/v1/chat/completions
streamingtrue