What The PP Changes Do
Prompt processing handles many tokens at once. In MoE models this creates many routed expert slots per layer. The PP work focuses on reducing the cost of sorting, splitting, materializing, and merging those slots while keeping decode behavior separate.
Non-Goals
These changes do not alter model weights, training, routing logits, or expert math. They only change how the hot-cache graph organizes work after the router has selected experts.
The implementation intentionally stays in src/moe-hot-cache
where possible. The graph still has to connect tensors, but policy,
worklist construction, and compact reduction live outside model
core files.
Phase Model
The hot-cache graph no longer decides PP versus TG from
n_tokens alone. llama_context infers one
llm_graph_phase for the whole llama_decode()
call and passes it through llm_graph_params.gphase.
Every ubatch inside that call inherits the same phase.
This matters for two edge cases: a prompt can end with a one-token ubatch, and some generation paths can verify multiple generated tokens in one decode call. The phase keeps both cases on the intended path.
Warmup
Warmup keeps conservative behavior. Worklist order remains token-major and PP-specific reduce paths stay off.
Prompt Processing
Prompt batches use the PP phase even if the final ubatch has only one token. Large PP shapes can use expert-major worklists, PP reduce merge, dense PP graph forms, and compact cold reduce.
Decode
Decode keeps token-major worklists. One-token decode can use direct merge, prefix merge, repeat hot input, and shared-row shortcuts. Tiny multi-token decode stays decode but only uses the shortcuts whose profile explicitly allows that shape.
Routing Prework
The standard MoE path and the hot-cache path should agree
mathematically until routing has produced selected_experts
and top-k weights. The paths should diverge only after
those tensors exist: standard MoE consumes them directly, while the
hot-cache path sends them through the worklist dispatcher into hot and
cold lanes.
Standard Path
build_moe_ffn builds router logits, selected experts,
normalized top-k weights, and then immediately continues into
normal expert execution.
Hot-Cache Path
build_layer_ffn_hot builds equivalent routing
tensors, creates a hot/cold worklist, evaluates both lanes, and
then reduces or merges the branch outputs.
| Stage | Standard MoE | Hot Cache |
|---|---|---|
| Router | build_lora_mm(ffn_gate_inp, cur) |
build_lora_mm(ffn_gate_inp, cur) |
| Selection | softmax(all logits), then top-k on probabilities |
top-k directly on logits |
| Weights | gather top-k weights, normalize, and scale if needed | gather top-k logits, softmax over the top-k set, and scale if needed |
| Consumer | normal expert execution | worklist dispatcher, hot lane, cold lane, reduce, merge |
For softmax routing, top-k on logits and top-k on softmax(logits) produce the same ordering. After top-k renormalization, the global softmax denominator cancels out. That makes the two routing variants equivalent for Qwen3.6, even though the graph node sequence differs.
Dense PP Graph
Dense PP is the default prompt-processing graph for the supported hot-cache adapters. The name means the graph keeps the original selected-slot grid available for merge semantics while the worklist decides which slots are hot, which are cold, and which expert lane should process each hot slot.
This is separate from --moe-hot-cache-pp-reduce-merge.
Dense PP decides whether a real multi-token prompt-processing batch
can enter the hot-cache graph. PP reduce-merge decides whether hot
and cold branch outputs are reduced to [n_embd, n_tokens]
before the final add. Both can be benchmarked independently.
Adapter Default
llama_moe_hot_cache_graph_profile::pp_dense enables
dense PP for Qwen35Moe, Qwen3Next, Gemma4, Mellum, GPT-OSS,
DeepSeek2, and GLM4 MoE profiles.
Cold Placement
pp_primary_cold_backend keeps dense PP cold-branch
graph work on the primary backend by default. The uncached expert
tensors still follow the normal CPU/RAM MoE path.
Multi-Lane Counters
The multi-lane dense worklist exposes fixed HOT,
HOT1, and HOT2 fields. Perf readback
sums those lane counts so the UI hit rate matches all cached
expert lanes, not only the first lane.
For A/B testing, set LLAMA_MOE_HOT_CACHE_PP_DENSE=0 to
disable dense PP or LLAMA_MOE_HOT_CACHE_PP_COLD_BACKEND=cpu
to force the older cold-backend placement. These switches are intended
for diagnosis, not normal profile files.
CUDA Fusion Detail
The standard MoE helper expands weights early so CUDA can detect the
topk-moe routing pattern. The relevant backend code lives
around ggml/src/ggml-cuda/topk-moe.cu and
ggml/src/ggml-cuda/ggml-cuda.cu.
The hot-cache graph must feed a dispatcher instead of the normal expert
consumer, so it uses a different node sequence. The math can be the
same while the backend optimization pattern is not. This is why routing
prework is a future refactor target rather than a trivial call into
build_moe_ffn.
Class And Module Diagram
The PP feature is split into policy, worklist building, graph glue, and compact branch reduction.
Runtime Flow
This is the hot-cache PP path after the router logits are available. The worklist is the dispatcher: it maps selected experts into hot and cold work and records how the branch outputs must be merged.
New Method Reference
These are the main methods/classes introduced or changed for the PP work. File links point at the local repository structure.
| Method or type | File | Responsibility | Why it helps |
|---|---|---|---|
llm_graph_phase |
llama-graph.h | Stores warmup, prompt_processing, or decode as graph topology input. |
Prevents graph reuse and hot-cache policy from mixing PP and TG shapes that happen to have the same token count. |
infer_decode_graph_phase |
llama-context.cpp | Infers the phase once for a full llama_decode() call before ubatch splitting. |
Keeps a one-token PP tail in PP and keeps tiny multi-token TG in decode. |
llama_moe_hot_cache_graph_phase_from_llm |
llama-moe-hot-cache-graph.cpp | Converts the core graph phase into the hot-cache local phase enum. | Keeps llama core independent from hot-cache policy types. |
llama_moe_hot_cache_pp_policy::build |
llama-moe-hot-cache-pp.cpp | Creates the execution plan from an explicit phase, token count, capacity, cold-lane state, and model profile. | Keeps phase-sensitive PP/TG decisions out of the graph glue and makes them unit-testable. |
reduce_merge_enabled |
llama-moe-hot-cache-pp.cpp | Reads the PP reduce-merge mode and decides off, on, or auto. | Provides the main runtime gate for the PP merge optimization. |
compact_cold_reduce_enabled |
llama-moe-hot-cache-pp.cpp | Controls the compact cold reduce sub-feature. | Allows the large PP cold-path optimization to be disabled independently. |
dense_enabled |
llama-moe-hot-cache-pp.cpp | Combines graph phase, token count, adapter profile, and LLAMA_MOE_HOT_CACHE_PP_DENSE. |
Lets real prompt-processing batches use the dense hot-cache graph while decode remains isolated. |
cold_backend |
llama-moe-hot-cache-pp.cpp | Chooses CPU or primary backend placement for dense PP cold graph work. | Makes the new cold-placement behavior profile-driven and overrideable for benchmarks. |
worklist_order |
llama-moe-hot-cache-pp.cpp | Selects token-major or expert-major worklist order. | Lets PP use expert grouping while decode remains token-major. |
llama_moe_hot_cache_build_worklist |
llama-moe-hot-cache-worklist.cpp | Builds hot/cold worklists from selected expert IDs and weights. | Central dispatcher for split, source slots, token IDs, weights, and counts. |
llama_moe_hot_cache_build_worklist_from_logits |
llama-moe-hot-cache-worklist.cpp | Builds worklists directly from logits for CPU decode routing paths. | Keeps logits-based routing in the same builder module. |
llama_moe_hot_cache_build_compact_cold_reduce |
llama-moe-hot-cache-branch-reduce.cpp | Adds the compact cold reduce op to the graph. | Avoids materializing a large cold slot tensor during PP. |
llama_moe_hot_cache_reduce_cold_token_rows |
llama-moe-hot-cache-branch-reduce.cpp | CPU callback that accumulates compact cold slots into token rows. | Implements the actual compact reduction with direct row pointers. |
build_layer_ffn_hot graph glue |
llama-moe-hot-cache-graph.cpp | Connects policy, worklist, hot lane, cold lane, and merge tensors. | Contains glue only; feature decisions and low-level work stay in helpers. |
llama_moe_hot_cache_build_moe_hot_pp_dense_from_logits |
llama-moe-hot-cache-graph.cpp | Builds the single-lane dense PP graph from router logits. | Keeps prompt-processing hot-cache execution in the generic logits adapter path. |
llama_moe_hot_cache_build_moe_hot_multi_pp_dense_from_logits |
llama-moe-hot-cache-graph.cpp | Builds dense PP for up to three hot-cache expert lanes plus the cold lane. | Lets multi-GPU hot-cache profiles use PP instead of falling back to the default llama path. |
ffn_moe_worklist_multi_pp_dense perf handling |
perf-reader.cpp and perf-json.cpp | Reads multi-lane dense PP worklists and derives hot/cold slot totals when direct counters are absent. | Keeps Web UI hit-rate and update inputs correct for multi-lane PP runs. |
Separation Of Concerns
Policy
llama-moe-hot-cache-pp decides what should happen.
It does not build tensors.
Worklist
llama-moe-hot-cache-worklist owns worklist layout,
token-major/expert-major ordering, and hot/cold split data.
Branch Reduce
llama-moe-hot-cache-branch-reduce owns the compact
cold reduce operation and its CPU callback.
Graph Glue
llama-moe-hot-cache-graph wires tensors together.
It should avoid owning policy or low-level reduce logic.
Future Routing Refactor
A clean next step is to extract routing prework from
build_moe_ffn into a read-only helper that returns router
logits, selected experts, and normalized weights. The standard path
would keep its current expert execution, while the hot-cache path would
consume the same routing result through the worklist builder.
Expected Benefits
- Less duplicated routing logic.
- Lower risk of math drift between standard and hot-cache routing.
- Better chance to preserve CUDA
topk-moefusion patterns. - Cleaner model onboarding because routing behavior is centralized.
Risks
build_moe_ffnis shared by many architectures.- CUDA fusion depends on exact graph node patterns.
- Model-specific bias, group routing, sigmoid gating, and scaling must remain unchanged.
- Decode small-token shortcuts must stay explicit.
| Step | Purpose |
|---|---|
| Extract read-only routing helper | Keep standard behavior unchanged while exposing reusable routing tensors. |
| Verify standard path first | Check PP and same-workload TG. For synthetic TG, inspect hot-cache hit rate because generated tokens can select a different expert distribution. |
| Rewire hot-cache PP | Consume the shared routing result in the worklist builder. |
| Re-run fixed-cache Qwen3.6 PP tests | Compare 512, 1024, 1536, 1792, 2048, 3072, and 4096 MiB cache sizes. |
| Check CUDA fusion | Confirm the standard path still triggers the intended top-k MoE fusion. |
The immediate goal is not to make dynamic graph decisions after top-k. First make routing shared and reliable; later threshold work can build on the same optimized routing path as standard llama.cpp.
Resolved Edge Cases
One-Token PP Tail
If a prompt is split as 512 + 1, the final ubatch has
n_tokens == 1, but the enclosing
llama_decode() call is still classified as prompt
processing. Decode-only shortcuts stay off for that tail.
Multi-Token Decode
Models or modes that decode multiple tokens at once can have
n_tokens > 1 during TG. Small verification batches
are classified as decode, so they do not accidentally use PP
worklist and reduce rules.