Benchmark Setup
All hot-cache PP measurements use the Qwen3.6 35B A3B Q6_K_XL model,
pp-bench-conversation-code.txt, CUDA0, -cmoe,
--moe-hot-cache-auto-reserve-mib 3000,
--moe-hot-cache-max-mib -1, and the same
qwen36 expert list unless noted otherwise.
The tg128 rows are synthetic llama-bench
generation runs. They are useful for checking that PP changes do not
move decode behavior, but their absolute value depends on whether the
generated tokens hit experts represented in the hot-cache list.
Baseline PP
75.67 t/s with standard llama.cpp placement using -ncmoe 31.
Best Kept PP
108.58 t/s after compact cold reduce, memory cleanup, and stack offset tables.
Net PP Change
+43.5% compared with the standard baseline in this benchmark path.
Coverage Sweep
A separate fixed-reserve sweep measured how much Qwen3.6 PP depends on real hot-slot hit rate. Raw expert coverage is only a rough proxy; the useful signal is how many actually selected expert slots hit the hot cache after routing.
| Cache | Hot Experts | Raw Expert Coverage | Hot-Slot Hit Rate | PP Throughput |
|---|---|---|---|---|
| 512 MiB | 148 / 10240 | 1.45% | 12.01% | 60.42 t/s |
| 1024 MiB | 337 / 10240 | 3.29% | 21.65% | 67.63 t/s |
| 1536 MiB | 525 / 10240 | 5.13% | 28.08% | 72.42 t/s |
| 1792 MiB | 620 / 10240 | 6.05% | 30.26% | 74.45 t/s |
| 2048 MiB | 714 / 10240 | 6.97% | 32.62% | 76.45 t/s |
| 3072 MiB | 1091 / 10240 | 10.65% | 39.81% | 83.87 t/s |
| 4096 MiB | 1468 / 10240 | 14.34% | 45.40% | 90.16 t/s |
In this sweep, break-even against the -ncmoe 31 baseline
was around 31-32% hot-slot hit rate. That corresponds to only about 7%
raw expert coverage for this Qwen3.6 run.
Static Coverage Guard
The optional LLAMA_MOE_HOT_CACHE_PP_MIN_HOT_EXPERT_RATIO
guard bypasses the hot-cache graph during prompt processing when a
layer has too few hot experts. Decode and warmup are not affected.
| Ratio | Cache | PP Throughput | Interpretation |
|---|---|---|---|
| 0.07 | 1536 MiB | 60.69 t/s | Too aggressive; bypasses useful hot-cache work. |
| 0.07 | 1792 MiB | 60.69 t/s | Still below the useful hot-cache path. |
| 0.07 | 2048 MiB | 72.46 t/s | Improves with more cache, but remains a blunt guard. |
| 0.02 | 512 MiB | 60.68 t/s | Acts as a small-cache safety guard. |
| 0.02 | 1024 MiB | 67.53 t/s | Leaves behavior effectively unchanged at this size. |
Runtime hot-slot hit rate is the better signal, but it is only known after top-k routing and dispatch. The static ratio guard is therefore only a coarse pre-routing fallback.
PP Throughput Progression
Bars marked as removed were measured but not kept because the runtime gain did not justify the complexity or rebase risk.
Experiment Timeline
| Step | PP t/s | TG t/s | Status | Decision |
|---|---|---|---|---|
Standard llama.cpp baseline, -ncmoe 31 |
75.67 +/- 0.01 | 22.14 +/- 0.10 | baseline | Reference point for PP and synthetic TG. |
| Hot-cache without PP reduce merge | 86.61 +/- 0.31 | 18.47 +/- 0.17 | kept baseline | Shows PP gain; TG is limited by synthetic expert mismatch. |
| PP reduce merge on | 93.55 +/- 0.70 | 18.56 +/- 0.25 | kept | Clear PP win, TG stable. |
| Expert-major PP worklist | 94.13 +/- 0.29 | 18.48 +/- 0.17 | kept | Small PP win, decode remains token-major. |
Runtime test: -ub 1024 |
95.15 +/- 0.77 | 18.55 +/- 0.24 | runtime option | Useful for large prompts, not a code change. |
| Compact cold reduce | 106.58 +/- 0.52 | 18.56 +/- 0.08 | kept | Largest PP code win, local to hot-cache code. |
| Hot branch scatter-add | 106.96 +/- 0.34 | 18.50 +/- 0.16 | removed | Too little gain, touched ggml core, high rebase risk. |
| Compact cold reduce memory cleanup | 108.51 +/- 0.39 | 18.76 +/- 0.36 | kept | Small but real PP win from better memory access. |
| Compact PP worklist field flags | 108.65 +/- 0.44 | 18.73 +/- 0.37 | removed | No measurable gain, added graph/template complexity. |
| Stack offset tables | 108.58 +/- 0.20 | 18.68 +/- 0.31 | kept as cleanup | No speed claim, but removes allocator activity from PP worklist path. |
Kept Changes
PP Reduce Merge
Reduces branch outputs earlier during PP. This was the first
clear PP-specific win and remains gated by
--moe-hot-cache-pp-reduce-merge.
Expert-Major Worklist
Groups PP slots by expert to improve locality. Decode and warmup stay token-major.
Compact Cold Reduce
Avoids materializing a large cold slot tensor before merge. This was the largest measured PP improvement.
Memory Cleanup
Uses contiguous zeroing, direct row pointers, and field-wise worklist initialization. Small measured gain, low risk.
Stack Offset Tables
Replaces per-layer offset vectors with fixed local arrays. No measurable speedup, but less allocator noise and simpler hotpath state.
Explicit PP/TG Phase
Passes a graph phase from llama_context through
llm_graph_params. A one-token PP tail stays on the PP
path, while tiny multi-token TG stays on the decode path.
Removed Experiments
Hot Branch Scatter-Add
It tried to add accumulate behavior to set_rows for
the hot branch. The gain was only about 0.4% and it touched ggml
CPU/CUDA core files, so the rebase risk was not justified.
Compact Worklist Field Flags
It skipped unused worklist fields in the PP compact path. The result overlapped with measurement noise and added graph dispatch complexity, so it was removed.
What We Learned
- PP is dominated by large intermediate slot and merge costs more than by tiny builder details.
- The cold branch was the right first target because compact cold reduction removed a large materialization step.
- Hot-slot hit rate predicts PP wins better than raw expert coverage.
- Small CPU cleanups can help, but only when they reduce memory traffic in a measured path.
- Changes that touch ggml core need a much larger win to justify rebase risk.
- Token count alone is not a safe phase signal; graph phase must travel with the graph parameters.
- The lower
tg128values are not caused by the PP work; the synthetic TG tokens hit experts that the measured hot-cache list does not cover well. - For future large gains, the likely direction is CUDA-side routing/reduction or a better PP-specific expert selection strategy.
The detailed raw Markdown record remains in
qwen36-pp-benchmark-path.md. This page is the visual
summary intended for browsing and review.