Benchmark Setup

All hot-cache PP measurements use the Qwen3.6 35B A3B Q6_K_XL model, pp-bench-conversation-code.txt, CUDA0, -cmoe, --moe-hot-cache-auto-reserve-mib 3000, --moe-hot-cache-max-mib -1, and the same qwen36 expert list unless noted otherwise. The tg128 rows are synthetic llama-bench generation runs. They are useful for checking that PP changes do not move decode behavior, but their absolute value depends on whether the generated tokens hit experts represented in the hot-cache list.

Baseline PP

75.67 t/s with standard llama.cpp placement using -ncmoe 31.

Best Kept PP

108.58 t/s after compact cold reduce, memory cleanup, and stack offset tables.

Net PP Change

+43.5% compared with the standard baseline in this benchmark path.

Coverage Sweep

A separate fixed-reserve sweep measured how much Qwen3.6 PP depends on real hot-slot hit rate. Raw expert coverage is only a rough proxy; the useful signal is how many actually selected expert slots hit the hot cache after routing.

Cache Hot Experts Raw Expert Coverage Hot-Slot Hit Rate PP Throughput
512 MiB148 / 102401.45%12.01%60.42 t/s
1024 MiB337 / 102403.29%21.65%67.63 t/s
1536 MiB525 / 102405.13%28.08%72.42 t/s
1792 MiB620 / 102406.05%30.26%74.45 t/s
2048 MiB714 / 102406.97%32.62%76.45 t/s
3072 MiB1091 / 1024010.65%39.81%83.87 t/s
4096 MiB1468 / 1024014.34%45.40%90.16 t/s

In this sweep, break-even against the -ncmoe 31 baseline was around 31-32% hot-slot hit rate. That corresponds to only about 7% raw expert coverage for this Qwen3.6 run.

Static Coverage Guard

The optional LLAMA_MOE_HOT_CACHE_PP_MIN_HOT_EXPERT_RATIO guard bypasses the hot-cache graph during prompt processing when a layer has too few hot experts. Decode and warmup are not affected.

Ratio Cache PP Throughput Interpretation
0.071536 MiB60.69 t/sToo aggressive; bypasses useful hot-cache work.
0.071792 MiB60.69 t/sStill below the useful hot-cache path.
0.072048 MiB72.46 t/sImproves with more cache, but remains a blunt guard.
0.02512 MiB60.68 t/sActs as a small-cache safety guard.
0.021024 MiB67.53 t/sLeaves behavior effectively unchanged at this size.

Runtime hot-slot hit rate is the better signal, but it is only known after top-k routing and dispatch. The static ratio guard is therefore only a coarse pre-routing fallback.

PP Throughput Progression

Bars marked as removed were measured but not kept because the runtime gain did not justify the complexity or rebase risk.

70 80 90 100 110 75.67 baseline 86.61 hot-cache 93.55 PP reduce 94.13 expert-major 95.15 ubatch 1024 106.58 cold reduce 106.96 scatter drop 108.51 memory cleanup 108.65 field flags drop 108.58 stack arrays

Experiment Timeline

Step PP t/s TG t/s Status Decision
Standard llama.cpp baseline, -ncmoe 31 75.67 +/- 0.01 22.14 +/- 0.10 baseline Reference point for PP and synthetic TG.
Hot-cache without PP reduce merge 86.61 +/- 0.31 18.47 +/- 0.17 kept baseline Shows PP gain; TG is limited by synthetic expert mismatch.
PP reduce merge on 93.55 +/- 0.70 18.56 +/- 0.25 kept Clear PP win, TG stable.
Expert-major PP worklist 94.13 +/- 0.29 18.48 +/- 0.17 kept Small PP win, decode remains token-major.
Runtime test: -ub 1024 95.15 +/- 0.77 18.55 +/- 0.24 runtime option Useful for large prompts, not a code change.
Compact cold reduce 106.58 +/- 0.52 18.56 +/- 0.08 kept Largest PP code win, local to hot-cache code.
Hot branch scatter-add 106.96 +/- 0.34 18.50 +/- 0.16 removed Too little gain, touched ggml core, high rebase risk.
Compact cold reduce memory cleanup 108.51 +/- 0.39 18.76 +/- 0.36 kept Small but real PP win from better memory access.
Compact PP worklist field flags 108.65 +/- 0.44 18.73 +/- 0.37 removed No measurable gain, added graph/template complexity.
Stack offset tables 108.58 +/- 0.20 18.68 +/- 0.31 kept as cleanup No speed claim, but removes allocator activity from PP worklist path.

Kept Changes

kept

PP Reduce Merge

Reduces branch outputs earlier during PP. This was the first clear PP-specific win and remains gated by --moe-hot-cache-pp-reduce-merge.

kept

Expert-Major Worklist

Groups PP slots by expert to improve locality. Decode and warmup stay token-major.

kept

Compact Cold Reduce

Avoids materializing a large cold slot tensor before merge. This was the largest measured PP improvement.

kept

Memory Cleanup

Uses contiguous zeroing, direct row pointers, and field-wise worklist initialization. Small measured gain, low risk.

cleanup

Stack Offset Tables

Replaces per-layer offset vectors with fixed local arrays. No measurable speedup, but less allocator noise and simpler hotpath state.

kept

Explicit PP/TG Phase

Passes a graph phase from llama_context through llm_graph_params. A one-token PP tail stays on the PP path, while tiny multi-token TG stays on the decode path.

Removed Experiments

removed

Hot Branch Scatter-Add

It tried to add accumulate behavior to set_rows for the hot branch. The gain was only about 0.4% and it touched ggml CPU/CUDA core files, so the rebase risk was not justified.

removed

Compact Worklist Field Flags

It skipped unused worklist fields in the PP compact path. The result overlapped with measurement noise and added graph dispatch complexity, so it was removed.

What We Learned

  • PP is dominated by large intermediate slot and merge costs more than by tiny builder details.
  • The cold branch was the right first target because compact cold reduction removed a large materialization step.
  • Hot-slot hit rate predicts PP wins better than raw expert coverage.
  • Small CPU cleanups can help, but only when they reduce memory traffic in a measured path.
  • Changes that touch ggml core need a much larger win to justify rebase risk.
  • Token count alone is not a safe phase signal; graph phase must travel with the graph parameters.
  • The lower tg128 values are not caused by the PP work; the synthetic TG tokens hit experts that the measured hot-cache list does not cover well.
  • For future large gains, the likely direction is CUDA-side routing/reduction or a better PP-specific expert selection strategy.

The detailed raw Markdown record remains in qwen36-pp-benchmark-path.md. This page is the visual summary intended for browsing and review.