MoE Hot Cache PP Architecture

What The PP Changes Do

Prompt processing handles many tokens at once. In MoE models this creates many routed expert slots per layer. The PP work focuses on reducing the cost of sorting, splitting, materializing, and merging those slots while keeping decode behavior separate.

PP reduce merge Dense PP worklist Compact cold reduce Expert-major worklist Multi-lane PP accounting Decode stays token-major

Non-Goals

These changes do not alter model weights, training, routing logits, or expert math. They only change how the hot-cache graph organizes work after the router has selected experts.

The implementation intentionally stays in src/moe-hot-cache where possible. The graph still has to connect tensors, but policy, worklist construction, and compact reduction live outside model core files.

Phase Model

The hot-cache graph no longer decides PP versus TG from n_tokens alone. llama_context infers one llm_graph_phase for the whole llama_decode() call and passes it through llm_graph_params.gphase. Every ubatch inside that call inherits the same phase.

This matters for two edge cases: a prompt can end with a one-token ubatch, and some generation paths can verify multiple generated tokens in one decode call. The phase keeps both cases on the intended path.

Warmup

Warmup keeps conservative behavior. Worklist order remains token-major and PP-specific reduce paths stay off.

Prompt Processing

Prompt batches use the PP phase even if the final ubatch has only one token. Large PP shapes can use expert-major worklists, PP reduce merge, dense PP graph forms, and compact cold reduce.

Decode

Decode keeps token-major worklists. One-token decode can use direct merge, prefix merge, repeat hot input, and shared-row shortcuts. Tiny multi-token decode stays decode but only uses the shortcuts whose profile explicitly allows that shape.

Routing Prework

The standard MoE path and the hot-cache path should agree mathematically until routing has produced selected_experts and top-k weights. The paths should diverge only after those tensors exist: standard MoE consumes them directly, while the hot-cache path sends them through the worklist dispatcher into hot and cold lanes.

Standard Path

build_moe_ffn builds router logits, selected experts, normalized top-k weights, and then immediately continues into normal expert execution.

Hot-Cache Path

build_layer_ffn_hot builds equivalent routing tensors, creates a hot/cold worklist, evaluates both lanes, and then reduces or merges the branch outputs.

Stage	Standard MoE	Hot Cache
Router	`build_lora_mm(ffn_gate_inp, cur)`	`build_lora_mm(ffn_gate_inp, cur)`
Selection	`softmax(all logits)`, then top-k on probabilities	top-k directly on logits
Weights	gather top-k weights, normalize, and scale if needed	gather top-k logits, softmax over the top-k set, and scale if needed
Consumer	normal expert execution	worklist dispatcher, hot lane, cold lane, reduce, merge

For softmax routing, top-k on logits and top-k on softmax(logits) produce the same ordering. After top-k renormalization, the global softmax denominator cancels out. That makes the two routing variants equivalent for Qwen3.6, even though the graph node sequence differs.

Dense PP Graph

Dense PP is the default prompt-processing graph for the supported hot-cache adapters. The name means the graph keeps the original selected-slot grid available for merge semantics while the worklist decides which slots are hot, which are cold, and which expert lane should process each hot slot.

This is separate from --moe-hot-cache-pp-reduce-merge. Dense PP decides whether a real multi-token prompt-processing batch can enter the hot-cache graph. PP reduce-merge decides whether hot and cold branch outputs are reduced to [n_embd, n_tokens] before the final add. Both can be benchmarked independently.

Adapter Default

llama_moe_hot_cache_graph_profile::pp_dense enables dense PP for Qwen35Moe, Qwen3Next, Gemma4, Mellum, GPT-OSS, DeepSeek2, and GLM4 MoE profiles.

Cold Placement

pp_primary_cold_backend keeps dense PP cold-branch graph work on the primary backend by default. The uncached expert tensors still follow the normal CPU/RAM MoE path.

Multi-Lane Counters

The multi-lane dense worklist exposes fixed HOT, HOT1, and HOT2 fields. Perf readback sums those lane counts so the UI hit rate matches all cached expert lanes, not only the first lane.

For A/B testing, set LLAMA_MOE_HOT_CACHE_PP_DENSE=0 to disable dense PP or LLAMA_MOE_HOT_CACHE_PP_COLD_BACKEND=cpu to force the older cold-backend placement. These switches are intended for diagnosis, not normal profile files.

CUDA Fusion Detail

The standard MoE helper expands weights early so CUDA can detect the topk-moe routing pattern. The relevant backend code lives around ggml/src/ggml-cuda/topk-moe.cu and ggml/src/ggml-cuda/ggml-cuda.cu.

The hot-cache graph must feed a dispatcher instead of the normal expert consumer, so it uses a different node sequence. The math can be the same while the backend optimization pattern is not. This is why routing prework is a future refactor target rather than a trivial call into build_moe_ffn.

Class And Module Diagram

The PP feature is split into policy, worklist building, graph glue, and compact branch reduction.

Runtime Flow

This is the hot-cache PP path after the router logits are available. The worklist is the dispatcher: it maps selected experts into hot and cold work and records how the branch outputs must be merged.

New Method Reference

These are the main methods/classes introduced or changed for the PP work. File links point at the local repository structure.

Method or type	File	Responsibility	Why it helps
`llm_graph_phase`	llama-graph.h	Stores `warmup`, `prompt_processing`, or `decode` as graph topology input.	Prevents graph reuse and hot-cache policy from mixing PP and TG shapes that happen to have the same token count.
`infer_decode_graph_phase`	llama-context.cpp	Infers the phase once for a full `llama_decode()` call before ubatch splitting.	Keeps a one-token PP tail in PP and keeps tiny multi-token TG in decode.
`llama_moe_hot_cache_graph_phase_from_llm`	llama-moe-hot-cache-graph.cpp	Converts the core graph phase into the hot-cache local phase enum.	Keeps llama core independent from hot-cache policy types.
`llama_moe_hot_cache_pp_policy::build`	llama-moe-hot-cache-pp.cpp	Creates the execution plan from an explicit phase, token count, capacity, cold-lane state, and model profile.	Keeps phase-sensitive PP/TG decisions out of the graph glue and makes them unit-testable.
`reduce_merge_enabled`	llama-moe-hot-cache-pp.cpp	Reads the PP reduce-merge mode and decides off, on, or auto.	Provides the main runtime gate for the PP merge optimization.
`compact_cold_reduce_enabled`	llama-moe-hot-cache-pp.cpp	Controls the compact cold reduce sub-feature.	Allows the large PP cold-path optimization to be disabled independently.
`dense_enabled`	llama-moe-hot-cache-pp.cpp	Combines graph phase, token count, adapter profile, and `LLAMA_MOE_HOT_CACHE_PP_DENSE`.	Lets real prompt-processing batches use the dense hot-cache graph while decode remains isolated.
`cold_backend`	llama-moe-hot-cache-pp.cpp	Chooses CPU or primary backend placement for dense PP cold graph work.	Makes the new cold-placement behavior profile-driven and overrideable for benchmarks.
`worklist_order`	llama-moe-hot-cache-pp.cpp	Selects token-major or expert-major worklist order.	Lets PP use expert grouping while decode remains token-major.
`llama_moe_hot_cache_build_worklist`	llama-moe-hot-cache-worklist.cpp	Builds hot/cold worklists from selected expert IDs and weights.	Central dispatcher for split, source slots, token IDs, weights, and counts.
`llama_moe_hot_cache_build_worklist_from_logits`	llama-moe-hot-cache-worklist.cpp	Builds worklists directly from logits for CPU decode routing paths.	Keeps logits-based routing in the same builder module.
`llama_moe_hot_cache_build_compact_cold_reduce`	llama-moe-hot-cache-branch-reduce.cpp	Adds the compact cold reduce op to the graph.	Avoids materializing a large cold slot tensor during PP.
`llama_moe_hot_cache_reduce_cold_token_rows`	llama-moe-hot-cache-branch-reduce.cpp	CPU callback that accumulates compact cold slots into token rows.	Implements the actual compact reduction with direct row pointers.
`build_layer_ffn_hot` graph glue	llama-moe-hot-cache-graph.cpp	Connects policy, worklist, hot lane, cold lane, and merge tensors.	Contains glue only; feature decisions and low-level work stay in helpers.
`llama_moe_hot_cache_build_moe_hot_pp_dense_from_logits`	llama-moe-hot-cache-graph.cpp	Builds the single-lane dense PP graph from router logits.	Keeps prompt-processing hot-cache execution in the generic logits adapter path.
`llama_moe_hot_cache_build_moe_hot_multi_pp_dense_from_logits`	llama-moe-hot-cache-graph.cpp	Builds dense PP for up to three hot-cache expert lanes plus the cold lane.	Lets multi-GPU hot-cache profiles use PP instead of falling back to the default llama path.
`ffn_moe_worklist_multi_pp_dense` perf handling	perf-reader.cpp and perf-json.cpp	Reads multi-lane dense PP worklists and derives hot/cold slot totals when direct counters are absent.	Keeps Web UI hit-rate and update inputs correct for multi-lane PP runs.

Separation Of Concerns

Policy

llama-moe-hot-cache-pp decides what should happen. It does not build tensors.

Worklist

llama-moe-hot-cache-worklist owns worklist layout, token-major/expert-major ordering, and hot/cold split data.

Branch Reduce

llama-moe-hot-cache-branch-reduce owns the compact cold reduce operation and its CPU callback.

Graph Glue

llama-moe-hot-cache-graph wires tensors together. It should avoid owning policy or low-level reduce logic.

Future Routing Refactor

A clean next step is to extract routing prework from build_moe_ffn into a read-only helper that returns router logits, selected experts, and normalized weights. The standard path would keep its current expert execution, while the hot-cache path would consume the same routing result through the worklist builder.

Expected Benefits

Less duplicated routing logic.
Lower risk of math drift between standard and hot-cache routing.
Better chance to preserve CUDA topk-moe fusion patterns.
Cleaner model onboarding because routing behavior is centralized.

Risks

build_moe_ffn is shared by many architectures.
CUDA fusion depends on exact graph node patterns.
Model-specific bias, group routing, sigmoid gating, and scaling must remain unchanged.
Decode small-token shortcuts must stay explicit.

Step	Purpose
Extract read-only routing helper	Keep standard behavior unchanged while exposing reusable routing tensors.
Verify standard path first	Check PP and same-workload TG. For synthetic TG, inspect hot-cache hit rate because generated tokens can select a different expert distribution.
Rewire hot-cache PP	Consume the shared routing result in the worklist builder.
Re-run fixed-cache Qwen3.6 PP tests	Compare 512, 1024, 1536, 1792, 2048, 3072, and 4096 MiB cache sizes.
Check CUDA fusion	Confirm the standard path still triggers the intended top-k MoE fusion.

The immediate goal is not to make dynamic graph decisions after top-k. First make routing shared and reliable; later threshold work can build on the same optimized routing path as standard llama.cpp.

Resolved Edge Cases

One-Token PP Tail

If a prompt is split as 512 + 1, the final ubatch has n_tokens == 1, but the enclosing llama_decode() call is still classified as prompt processing. Decode-only shortcuts stay off for that tail.

Multi-Token Decode

Models or modes that decode multiple tokens at once can have n_tokens > 1 during TG. Small verification batches are classified as decode, so they do not accidentally use PP worklist and reduce rules.