MoE Hot Cache PP Journey

Benchmark Setup

All hot-cache PP measurements use the Qwen3.6 35B A3B Q6_K_XL model, pp-bench-conversation-code.txt, CUDA0, -cmoe, --moe-hot-cache-auto-reserve-mib 3000, --moe-hot-cache-max-mib -1, and the same qwen36 expert list unless noted otherwise. The tg128 rows are synthetic llama-bench generation runs. They are useful for checking that PP changes do not move decode behavior, but their absolute value depends on whether the generated tokens hit experts represented in the hot-cache list.

Baseline PP

75.67 t/s with standard llama.cpp placement using -ncmoe 31.

Best Kept PP

108.58 t/s after compact cold reduce, memory cleanup, and stack offset tables.

Net PP Change

+43.5% compared with the standard baseline in this benchmark path.

Coverage Sweep

A separate fixed-reserve sweep measured how much Qwen3.6 PP depends on real hot-slot hit rate. Raw expert coverage is only a rough proxy; the useful signal is how many actually selected expert slots hit the hot cache after routing.

Cache	Hot Experts	Raw Expert Coverage	Hot-Slot Hit Rate	PP Throughput
512 MiB	148 / 10240	1.45%	12.01%	60.42 t/s
1024 MiB	337 / 10240	3.29%	21.65%	67.63 t/s
1536 MiB	525 / 10240	5.13%	28.08%	72.42 t/s
1792 MiB	620 / 10240	6.05%	30.26%	74.45 t/s
2048 MiB	714 / 10240	6.97%	32.62%	76.45 t/s
3072 MiB	1091 / 10240	10.65%	39.81%	83.87 t/s
4096 MiB	1468 / 10240	14.34%	45.40%	90.16 t/s

In this sweep, break-even against the -ncmoe 31 baseline was around 31-32% hot-slot hit rate. That corresponds to only about 7% raw expert coverage for this Qwen3.6 run.

Static Coverage Guard

The optional LLAMA_MOE_HOT_CACHE_PP_MIN_HOT_EXPERT_RATIO guard bypasses the hot-cache graph during prompt processing when a layer has too few hot experts. Decode and warmup are not affected.

Ratio	Cache	PP Throughput	Interpretation
0.07	1536 MiB	60.69 t/s	Too aggressive; bypasses useful hot-cache work.
0.07	1792 MiB	60.69 t/s	Still below the useful hot-cache path.
0.07	2048 MiB	72.46 t/s	Improves with more cache, but remains a blunt guard.
0.02	512 MiB	60.68 t/s	Acts as a small-cache safety guard.
0.02	1024 MiB	67.53 t/s	Leaves behavior effectively unchanged at this size.

Runtime hot-slot hit rate is the better signal, but it is only known after top-k routing and dispatch. The static ratio guard is therefore only a coarse pre-routing fallback.

PP Throughput Progression

Bars marked as removed were measured but not kept because the runtime gain did not justify the complexity or rebase risk.

Experiment Timeline

Step	PP t/s	TG t/s	Status	Decision
Standard llama.cpp baseline, `-ncmoe 31`	75.67 +/- 0.01	22.14 +/- 0.10	baseline	Reference point for PP and synthetic TG.
Hot-cache without PP reduce merge	86.61 +/- 0.31	18.47 +/- 0.17	kept baseline	Shows PP gain; TG is limited by synthetic expert mismatch.
PP reduce merge on	93.55 +/- 0.70	18.56 +/- 0.25	kept	Clear PP win, TG stable.
Expert-major PP worklist	94.13 +/- 0.29	18.48 +/- 0.17	kept	Small PP win, decode remains token-major.
Runtime test: `-ub 1024`	95.15 +/- 0.77	18.55 +/- 0.24	runtime option	Useful for large prompts, not a code change.
Compact cold reduce	106.58 +/- 0.52	18.56 +/- 0.08	kept	Largest PP code win, local to hot-cache code.
Hot branch scatter-add	106.96 +/- 0.34	18.50 +/- 0.16	removed	Too little gain, touched ggml core, high rebase risk.
Compact cold reduce memory cleanup	108.51 +/- 0.39	18.76 +/- 0.36	kept	Small but real PP win from better memory access.
Compact PP worklist field flags	108.65 +/- 0.44	18.73 +/- 0.37	removed	No measurable gain, added graph/template complexity.
Stack offset tables	108.58 +/- 0.20	18.68 +/- 0.31	kept as cleanup	No speed claim, but removes allocator activity from PP worklist path.

Kept Changes

kept

PP Reduce Merge

Reduces branch outputs earlier during PP. This was the first clear PP-specific win and remains gated by --moe-hot-cache-pp-reduce-merge.

kept

Expert-Major Worklist

Groups PP slots by expert to improve locality. Decode and warmup stay token-major.

kept

Compact Cold Reduce

Avoids materializing a large cold slot tensor before merge. This was the largest measured PP improvement.

kept

Memory Cleanup

Uses contiguous zeroing, direct row pointers, and field-wise worklist initialization. Small measured gain, low risk.

cleanup

Stack Offset Tables

Replaces per-layer offset vectors with fixed local arrays. No measurable speedup, but less allocator noise and simpler hotpath state.

kept

Explicit PP/TG Phase

Passes a graph phase from llama_context through llm_graph_params. A one-token PP tail stays on the PP path, while tiny multi-token TG stays on the decode path.

Removed Experiments

removed

Hot Branch Scatter-Add

It tried to add accumulate behavior to set_rows for the hot branch. The gain was only about 0.4% and it touched ggml CPU/CUDA core files, so the rebase risk was not justified.

removed

Compact Worklist Field Flags

It skipped unused worklist fields in the PP compact path. The result overlapped with measurement noise and added graph dispatch complexity, so it was removed.

What We Learned

PP is dominated by large intermediate slot and merge costs more than by tiny builder details.
The cold branch was the right first target because compact cold reduction removed a large materialization step.
Hot-slot hit rate predicts PP wins better than raw expert coverage.
Small CPU cleanups can help, but only when they reduce memory traffic in a measured path.
Changes that touch ggml core need a much larger win to justify rebase risk.
Token count alone is not a safe phase signal; graph phase must travel with the graph parameters.
The lower tg128 values are not caused by the PP work; the synthetic TG tokens hit experts that the measured hot-cache list does not cover well.
For future large gains, the likely direction is CUDA-side routing/reduction or a better PP-specific expert selection strategy.

The detailed raw Markdown record remains in qwen36-pp-benchmark-path.md. This page is the visual summary intended for browsing and review.