What The PP Changes Do

Prompt processing handles many tokens at once. In MoE models this creates many routed expert slots per layer. The PP work focuses on reducing the cost of sorting, splitting, materializing, and merging those slots while keeping decode behavior separate.

PP reduce merge Dense PP worklist Compact cold reduce Expert-major worklist Multi-lane PP accounting Decode stays token-major

Non-Goals

These changes do not alter model weights, training, routing logits, or expert math. They only change how the hot-cache graph organizes work after the router has selected experts.

The implementation intentionally stays in src/moe-hot-cache where possible. The graph still has to connect tensors, but policy, worklist construction, and compact reduction live outside model core files.

Phase Model

The hot-cache graph no longer decides PP versus TG from n_tokens alone. llama_context infers one llm_graph_phase for the whole llama_decode() call and passes it through llm_graph_params.gphase. Every ubatch inside that call inherits the same phase.

This matters for two edge cases: a prompt can end with a one-token ubatch, and some generation paths can verify multiple generated tokens in one decode call. The phase keeps both cases on the intended path.

Warmup

Warmup keeps conservative behavior. Worklist order remains token-major and PP-specific reduce paths stay off.

Prompt Processing

Prompt batches use the PP phase even if the final ubatch has only one token. Large PP shapes can use expert-major worklists, PP reduce merge, dense PP graph forms, and compact cold reduce.

Decode

Decode keeps token-major worklists. One-token decode can use direct merge, prefix merge, repeat hot input, and shared-row shortcuts. Tiny multi-token decode stays decode but only uses the shortcuts whose profile explicitly allows that shape.

Routing Prework

The standard MoE path and the hot-cache path should agree mathematically until routing has produced selected_experts and top-k weights. The paths should diverge only after those tensors exist: standard MoE consumes them directly, while the hot-cache path sends them through the worklist dispatcher into hot and cold lanes.

Standard Path

build_moe_ffn builds router logits, selected experts, normalized top-k weights, and then immediately continues into normal expert execution.

Hot-Cache Path

build_layer_ffn_hot builds equivalent routing tensors, creates a hot/cold worklist, evaluates both lanes, and then reduces or merges the branch outputs.

Stage Standard MoE Hot Cache
Router build_lora_mm(ffn_gate_inp, cur) build_lora_mm(ffn_gate_inp, cur)
Selection softmax(all logits), then top-k on probabilities top-k directly on logits
Weights gather top-k weights, normalize, and scale if needed gather top-k logits, softmax over the top-k set, and scale if needed
Consumer normal expert execution worklist dispatcher, hot lane, cold lane, reduce, merge

For softmax routing, top-k on logits and top-k on softmax(logits) produce the same ordering. After top-k renormalization, the global softmax denominator cancels out. That makes the two routing variants equivalent for Qwen3.6, even though the graph node sequence differs.

Dense PP Graph

Dense PP is the default prompt-processing graph for the supported hot-cache adapters. The name means the graph keeps the original selected-slot grid available for merge semantics while the worklist decides which slots are hot, which are cold, and which expert lane should process each hot slot.

This is separate from --moe-hot-cache-pp-reduce-merge. Dense PP decides whether a real multi-token prompt-processing batch can enter the hot-cache graph. PP reduce-merge decides whether hot and cold branch outputs are reduced to [n_embd, n_tokens] before the final add. Both can be benchmarked independently.

Adapter Default

llama_moe_hot_cache_graph_profile::pp_dense enables dense PP for Qwen35Moe, Qwen3Next, Gemma4, Mellum, GPT-OSS, DeepSeek2, and GLM4 MoE profiles.

Cold Placement

pp_primary_cold_backend keeps dense PP cold-branch graph work on the primary backend by default. The uncached expert tensors still follow the normal CPU/RAM MoE path.

Multi-Lane Counters

The multi-lane dense worklist exposes fixed HOT, HOT1, and HOT2 fields. Perf readback sums those lane counts so the UI hit rate matches all cached expert lanes, not only the first lane.

For A/B testing, set LLAMA_MOE_HOT_CACHE_PP_DENSE=0 to disable dense PP or LLAMA_MOE_HOT_CACHE_PP_COLD_BACKEND=cpu to force the older cold-backend placement. These switches are intended for diagnosis, not normal profile files.

CUDA Fusion Detail

The standard MoE helper expands weights early so CUDA can detect the topk-moe routing pattern. The relevant backend code lives around ggml/src/ggml-cuda/topk-moe.cu and ggml/src/ggml-cuda/ggml-cuda.cu.

The hot-cache graph must feed a dispatcher instead of the normal expert consumer, so it uses a different node sequence. The math can be the same while the backend optimization pattern is not. This is why routing prework is a future refactor target rather than a trivial call into build_moe_ffn.

Class And Module Diagram

The PP feature is split into policy, worklist building, graph glue, and compact branch reduction.

Model Adapter/Profile llama_moe_hot_cache_model_adapter llama_moe_hot_cache_graph_profile Declares decode and graph capabilities PP Policy llama_moe_hot_cache_pp_policy llama_moe_hot_cache_pp_execution_plan Uses explicit graph phase, then chooses reduce merge and worklist order Graph Glue llama-moe-hot-cache-graph.cpp Builds tensors and connects hot lane, cold lane, merge Worklist Builder llama_moe_hot_cache_build_worklist build_worklist_from_logits Creates hot/cold ids, token ids, weights, counts, source slots Compact Cold Reduce llama_moe_hot_cache_build_compact_cold_reduce llama_moe_hot_cache_reduce_cold_token_rows Accumulates compact cold slots directly into token rows

Runtime Flow

This is the hot-cache PP path after the router logits are available. The worklist is the dispatcher: it maps selected experts into hot and cold work and records how the branch outputs must be merged.

PP UBatch phase = PP Router logits per token Top-K ids + weights Worklist expert-major PP Hot Lane cached experts on CUDA0 Cold Lane CPU/RAM expert path Branch Merge hot + compact cold

New Method Reference

These are the main methods/classes introduced or changed for the PP work. File links point at the local repository structure.

Method or type File Responsibility Why it helps
llm_graph_phase llama-graph.h Stores warmup, prompt_processing, or decode as graph topology input. Prevents graph reuse and hot-cache policy from mixing PP and TG shapes that happen to have the same token count.
infer_decode_graph_phase llama-context.cpp Infers the phase once for a full llama_decode() call before ubatch splitting. Keeps a one-token PP tail in PP and keeps tiny multi-token TG in decode.
llama_moe_hot_cache_graph_phase_from_llm llama-moe-hot-cache-graph.cpp Converts the core graph phase into the hot-cache local phase enum. Keeps llama core independent from hot-cache policy types.
llama_moe_hot_cache_pp_policy::build llama-moe-hot-cache-pp.cpp Creates the execution plan from an explicit phase, token count, capacity, cold-lane state, and model profile. Keeps phase-sensitive PP/TG decisions out of the graph glue and makes them unit-testable.
reduce_merge_enabled llama-moe-hot-cache-pp.cpp Reads the PP reduce-merge mode and decides off, on, or auto. Provides the main runtime gate for the PP merge optimization.
compact_cold_reduce_enabled llama-moe-hot-cache-pp.cpp Controls the compact cold reduce sub-feature. Allows the large PP cold-path optimization to be disabled independently.
dense_enabled llama-moe-hot-cache-pp.cpp Combines graph phase, token count, adapter profile, and LLAMA_MOE_HOT_CACHE_PP_DENSE. Lets real prompt-processing batches use the dense hot-cache graph while decode remains isolated.
cold_backend llama-moe-hot-cache-pp.cpp Chooses CPU or primary backend placement for dense PP cold graph work. Makes the new cold-placement behavior profile-driven and overrideable for benchmarks.
worklist_order llama-moe-hot-cache-pp.cpp Selects token-major or expert-major worklist order. Lets PP use expert grouping while decode remains token-major.
llama_moe_hot_cache_build_worklist llama-moe-hot-cache-worklist.cpp Builds hot/cold worklists from selected expert IDs and weights. Central dispatcher for split, source slots, token IDs, weights, and counts.
llama_moe_hot_cache_build_worklist_from_logits llama-moe-hot-cache-worklist.cpp Builds worklists directly from logits for CPU decode routing paths. Keeps logits-based routing in the same builder module.
llama_moe_hot_cache_build_compact_cold_reduce llama-moe-hot-cache-branch-reduce.cpp Adds the compact cold reduce op to the graph. Avoids materializing a large cold slot tensor during PP.
llama_moe_hot_cache_reduce_cold_token_rows llama-moe-hot-cache-branch-reduce.cpp CPU callback that accumulates compact cold slots into token rows. Implements the actual compact reduction with direct row pointers.
build_layer_ffn_hot graph glue llama-moe-hot-cache-graph.cpp Connects policy, worklist, hot lane, cold lane, and merge tensors. Contains glue only; feature decisions and low-level work stay in helpers.
llama_moe_hot_cache_build_moe_hot_pp_dense_from_logits llama-moe-hot-cache-graph.cpp Builds the single-lane dense PP graph from router logits. Keeps prompt-processing hot-cache execution in the generic logits adapter path.
llama_moe_hot_cache_build_moe_hot_multi_pp_dense_from_logits llama-moe-hot-cache-graph.cpp Builds dense PP for up to three hot-cache expert lanes plus the cold lane. Lets multi-GPU hot-cache profiles use PP instead of falling back to the default llama path.
ffn_moe_worklist_multi_pp_dense perf handling perf-reader.cpp and perf-json.cpp Reads multi-lane dense PP worklists and derives hot/cold slot totals when direct counters are absent. Keeps Web UI hit-rate and update inputs correct for multi-lane PP runs.

Separation Of Concerns

Policy

llama-moe-hot-cache-pp decides what should happen. It does not build tensors.

Worklist

llama-moe-hot-cache-worklist owns worklist layout, token-major/expert-major ordering, and hot/cold split data.

Branch Reduce

llama-moe-hot-cache-branch-reduce owns the compact cold reduce operation and its CPU callback.

Graph Glue

llama-moe-hot-cache-graph wires tensors together. It should avoid owning policy or low-level reduce logic.

Future Routing Refactor

A clean next step is to extract routing prework from build_moe_ffn into a read-only helper that returns router logits, selected experts, and normalized weights. The standard path would keep its current expert execution, while the hot-cache path would consume the same routing result through the worklist builder.

Expected Benefits

  • Less duplicated routing logic.
  • Lower risk of math drift between standard and hot-cache routing.
  • Better chance to preserve CUDA topk-moe fusion patterns.
  • Cleaner model onboarding because routing behavior is centralized.

Risks

  • build_moe_ffn is shared by many architectures.
  • CUDA fusion depends on exact graph node patterns.
  • Model-specific bias, group routing, sigmoid gating, and scaling must remain unchanged.
  • Decode small-token shortcuts must stay explicit.
Step Purpose
Extract read-only routing helper Keep standard behavior unchanged while exposing reusable routing tensors.
Verify standard path first Check PP and same-workload TG. For synthetic TG, inspect hot-cache hit rate because generated tokens can select a different expert distribution.
Rewire hot-cache PP Consume the shared routing result in the worklist builder.
Re-run fixed-cache Qwen3.6 PP tests Compare 512, 1024, 1536, 1792, 2048, 3072, and 4096 MiB cache sizes.
Check CUDA fusion Confirm the standard path still triggers the intended top-k MoE fusion.

The immediate goal is not to make dynamic graph decisions after top-k. First make routing shared and reliable; later threshold work can build on the same optimized routing path as standard llama.cpp.

Resolved Edge Cases

One-Token PP Tail

If a prompt is split as 512 + 1, the final ubatch has n_tokens == 1, but the enclosing llama_decode() call is still classified as prompt processing. Decode-only shortcuts stay off for that tail.

Multi-Token Decode

Models or modes that decode multiple tokens at once can have n_tokens > 1 during TG. Small verification batches are classified as decode, so they do not accidentally use PP worklist and reduce rules.