MoE Hot-Cache Architecture Explainer

Overall Architecture

Click a block to see why it exists and which file contains the relevant code.

GPU Hot CPU Cold Metrics/Update Graph/Scheduler

Block Explanation

Overview

This feature is an additive path. Without the hot cache, llama.cpp runs normally. With the hot cache, the original MoE experts stay in the model and selected experts are additionally copied into VRAM.

Start: --cpu-moe --moe-hot-cache file.json --moe-hot-cache-max-mib -1

Learn Run

A regular run records which experts are actually used per layer.

Server Start

The JSON is parsed, experts are scored, and the best candidates are packed into a VRAM budget.

Decode

For each token, the worklist separates hot slots and cold slots into two parallel lanes.

Update

After a request, weak cache entries can be exchanged for better experts.

Where Does Each Part Live?

llama.cpp Touch Points

Model HooksSmall guards call the hot path only when the adapter allows this graph kind.

SchedulerRecognizes the marked hot/cold region and overlaps the branches when possible.

Server HookExposes perf JSON and triggers optional post-request cache updates.

Hot-Cache Package

AdapterOwns model support, graph kind, FFN op, and profile defaults.

Parser/PlannerReads perf JSON, scores experts, sizes candidates, and fits the budget.

Builder/RuntimeCreates cache tensors, copies expert slices, and builds worklists.

Perf and Update

Perf StateCollects hits, hot/cold slots, timings, fallback reasons, and expert arrays.

UpdaterExchanges cache entries inside the existing capacity after a request.

TestsComponent tests cover parser, weighting, planner, builder, worklist, adapter, budget, and perf.

Hardware

CUDA CacheComputes selected hot expert slices from the extra VRAM cache.

CPU/RAMKeeps the original experts and computes cold misses when --cpu-moe is used.

VRAM BudgetLimits cache size and is controlled by explicit MiB or auto sizing with reserve.

Current rule: most feature logic lives under src/moe-hot-cache. The llama.cpp model files should stay as small guarded hooks, so upstream rebases remain manageable.

Before and After

Without Feature

RouterSelects experts.

MoE FFNAll selected experts run through the normal path.

OutputThe result moves to the next layer.

With Hot Cache

WorklistSplits expert slots into hot and cold work.

ParallelGPU and CPU work at the same time.

MergeBoth results are added back together.

Performance improves only when enough slots are hot and the cold lane does not trail too far behind.

Class Diagram

The boxes show the main building blocks and the direction in which data flows.

The parser lives in llama_moe_hot_cache_perf_json_parser. It is deliberately separated so learn files, /moe-layer-perf, manual /moe-hot-cache applies, and automatic updates share the same JSON interpretation.

Data Models

These structs are the feature's internal data contract. Read them from top to bottom as JSON observations becoming a plan, then a runtime cache, then perf data for updates.

`llama_moe_hot_cache_entry`

A ranked expert reference used by weighting and planning.

Field	Meaning
`layer`	Layer index that owns the expert.
`expert`	Expert id inside that layer.
`hit_count`	Score used for sorting. It may be raw hits or a weighted score.

`llama_moe_hot_cache_expert_observation`

One expert row parsed from a learn file or from /moe-layer-perf.

Field	Meaning
`expert`	Expert id in the layer.
`hot`	How often this expert was served from the hot cache.
`cold`	How often this expert fell through to the cold path.
`raw`	Original count when the source has no hot/cold split, usually from a learn run.

`llama_moe_hot_cache_layer_observation`

Layer-level input for scoring. This is where hit data and timing pressure meet.

Field	Meaning
`layer`	Layer index.
`experts`	All observed expert rows for this layer.
`has_branch_counts`	Whether hot/cold branch data is available.
`cold_slots_per_call`	Average number of cold slots per layer call.
`parallel_*_time_per_call_us`	Join wait, hot lane, cold lane, and total MoE timings normalized per call.
`wait_per_cold_slot_us`	Pressure signal used to decide which cold misses hurt most.

`llama_moe_hot_cache_plan`

The planner output. It says what was considered and what fits into the cache budget.

Field	Meaning
`observed`	Ranked expert entries after parsing and weighting.
`selected`	Experts selected for the VRAM cache, including byte size.
`budget_bytes`	Maximum allowed cache size.
`used_bytes`	Actual size used by selected experts.

`llama_moe_hot_cache_expert_size`

Memory accounting for a single expert candidate.

Field	Meaning
`layer`	Layer index.
`expert`	Expert id inside that layer.
`bytes`	Estimated bytes needed to cache this expert's tensor slices.

`llama_moe_hot_cache_weighting_config`

Configuration that controls how observations become cache scores.

Field	Meaning
`mode`	Scoring strategy: `flat`, `pressure`, `smooth_pressure`, `time`, or `balanced`.
`layer_curve`	How strongly the scoring curve reshapes layer priority.

`llama_moe_hot_cache_layer`

Runtime cache state for one model layer.

Field	Meaning
`ffn_*_exps`	Cached expert tensors for gate/up/down variants, depending on model layout.
`hot_id_map`	Device-side map from original expert id to cache slot id.
`hot_mask`, `cold_mask`	Device-side masks used by graph operations to split hot and cold work.
`hot_id_map_host`	Host copy of the expert-to-cache-slot map, used by CPU routing and updates.
`n_hot`, `n_expert`	Number of cached experts and total experts in the layer.
`expert_weights_scale`	Optional model scale applied to expert weights.

`llama_moe_hot_cache_worklist_field`

Column layout of the worklist tensor consumed by graph operations.

Field Group	Meaning
`HOT_ID`, `HOT_SRC_SLOT`, `HOT_TOKEN_ID`, `HOT_WEIGHT`	Compact hot-lane job description.
`COLD_ID`, `COLD_SRC_SLOT`, `COLD_TOKEN_ID`, `COLD_WEIGHT`	Compact cold-lane job description.
`HOT_EXPERT_ID`	Original expert id for hot-cache accounting and update data.
`HOT_COUNT`, `COLD_COUNT`	Per-call counts used to size and skip branch work.

`llm_graph_phase`

Core graph phase passed through llm_graph_params.

Value	Meaning
`LLM_GRAPH_PHASE_WARMUP`	Conservative warmup graph, no PP-specific hot-cache shortcuts.
`LLM_GRAPH_PHASE_PROMPT_PROCESSING`	Prompt-processing call. All ubatches from that call stay PP, including a one-token tail.
`LLM_GRAPH_PHASE_DECODE`	Token-generation call. Tiny multi-token verification batches remain TG instead of falling into PP.
`LLM_GRAPH_PHASE_UNKNOWN`	Fallback for callers that do not provide a phase; hot-cache code keeps the old token-count fallback only there.

`llama_moe_hot_cache`

Top-level runtime object attached to the model.

Field	Meaning
`layers`	Per-layer cache state.
`ctxs`	GGML contexts owning the cache tensors.
`bufs`	Backend buffers that hold the actual memory.
`active()`	Returns true when at least one layer has a usable cache.

`llama_moe_hot_cache_model_adapter`

The model compatibility entry. It keeps model-specific behavior out of generic code.

Field	Meaning
`arch`	llama architecture enum this adapter supports.
`name`	Human-readable adapter name for code review and diagnostics.
`graph_kind`	Which hot graph shape the model is allowed to use.
`ffn_op`	Activation operation used by the MoE FFN path.
`profile()`	Returns the optimization profile for this architecture.

`llama_moe_hot_cache_graph_profile`

Per-model graph optimization flags.

Field	Meaning
`cpu_decode_routing`	Route tiny decode batches on CPU to reduce graph overhead.
`decode_direct_merge`, `merge_sum_rows`	Choose faster merge forms when graph shape allows it.
`cold_prefix_sum`, `cold_prefix_weighted_sum`	Reduce only compact cold prefixes instead of full slot tensors.
`branch_reduce_merge`	Let branches reduce before the final merge, useful for some models.
`cpu_decode_routing_max_tokens`	Maximum tiny-batch size that may use CPU routing.
`prefix_reduce_tasks_max`	Upper bound for CPU tasks used in prefix reduction.
`pp_dense`	Allows real multi-token prompt-processing batches to use the dense hot-cache graph.
`pp_primary_cold_backend`	Places dense PP cold-branch graph work on the primary backend by default.

`llama_moe_hot_cache_update_stats`

Summary logged after dynamic cache replacement.

Field	Meaning
`active`	Whether an update was attempted.
`update_rate`	Configured fraction of hot experts that may be exchanged.
`hit_rate`	Hot-slot ratio observed in the completed request.
`hot_slots`, `cold_slots`	Total hot and cold expert slots observed.
`candidates`, `max_exchange`	Available replacements and configured exchange budget.
`exchanged`, `layers_changed`	How much the cache actually changed.

`llama_moe_layer_perf_layer`

Per-layer counters exposed through /moe-layer-perf.

Field Group	Meaning
`calls`, `expert_hits_total`	How often the layer ran and how many expert slots were selected.
`hot_slots_total`, `cold_slots_total`	Hot/cold split totals used for hit rate and update decisions.
`*_time_us`	MoE, routing, worklist, matmul, branch, gather/scatter, and merge timings.
`parallel_*`	Scheduler wall time, lane launches, overlap estimate, join wait, and fallback reasons.
`experts`, `hot_experts`, `cold_experts`	Per-expert counts for visualization, learning, manual applies, and automatic updates.

`llama_moe_layer_perf_state`

Thread-safe owner for all live MoE performance counters.

Field	Meaning
`mutex`	Protects updates to counters and layer vectors.
`n_expert`, `n_expert_used`	Shape metadata for the currently measured model.
`updates`, `overflow_resets`	Bookkeeping for counter lifecycle and overflow protection.
`active`	Whether performance collection is currently enabled.
`layers`	Vector of per-layer counters.

Code Reference

This section supersedes the old Markdown code reference. The old file was written before the separation into src/moe-hot-cache/; the tables below describe the current ownership boundaries and entry points.

`llama-moe-hot-cache.cpp`

Top-level lifecycle orchestration. It wires parser, weighting, budget, planner, builder, and updater together.

Entry Point	Responsibility
`llama_moe_hot_cache_init()`	Reads the JSON, scores observations, computes the budget, selects experts, and calls the builder.
`llama_moe_hot_cache_init_after_model_load()`	Initializes fixed-size caches before context memory when `max_mib > 0`.
`llama_moe_hot_cache_init_after_context_memory()`	Initializes auto-sized caches after context memory when `max_mib == -1`.
`llama_moe_hot_cache_update_from_perf_json()`	Parses live perf data, scores current observations, and delegates replacement to the updater.
`llama_moe_hot_cache_apply_json()`	Parses a manual `/moe-hot-cache` payload and applies the full delta budget independent of the automatic update rate.
`llama_moe_hot_cache_layer_active_for_graph()`	Checks adapter support, layer bounds, runtime cache presence, and graph kind compatibility.

`llama-moe-hot-cache-parser.*`

The JSON parser class shared by learn files, /moe-layer-perf, manual /moe-hot-cache applies, and automatic updates.

Method	Responsibility
`parse_observations()`	Reads layers, `experts`, `hot_experts`, `cold_experts`, and timing fields into typed observations.
`parse_enabled_layer_slots()`	Reads only the hot/cold slot totals needed to report hit rate and update stats.
`llama_moe_hot_cache_perf_json_layer_slots`	Compact layer, hot-slot, cold-slot tuple used by the updater.

`llama-moe-hot-cache-weighting.cpp`

Turns observations into sortable expert scores. This is model-neutral and used by initial fill and dynamic update.

Method / Mode	Responsibility
`parse_mode()`, `mode_name()`	Maps CLI/env names to canonical weighting modes.
`default_config()`	Builds the default config; current default is `flat` with layer curve `0.5`.
`score_observations()`	Public scoring entry used by cache init and update.
`flat`	Interleaves best experts across layers and is the current default because it gives stable layer coverage under tight VRAM.
`pressure`, `smooth_pressure`, `time`, `balanced`	Alternative curves that favor slow layers, smooth outliers, total MoE time, or stronger per-layer distribution.

`planner.cpp` and `budget.cpp`

Memory accounting and budget selection. These files decide what can physically fit.

Method	Responsibility
`llama_moe_hot_cache_tensor_expert_bytes()`	Computes the byte cost of one expert slice from a tensor.
`llama_moe_hot_cache_collect_expert_sizes()`	Collects candidate expert sizes from model layers that actually have MoE expert tensors.
`llama_moe_hot_cache_select()`	Packs ranked experts into the byte budget while accounting for per-layer dummy padding.
`llama_moe_hot_cache_select_multi()`	Packs ranked experts into up to three fixed expert lanes. `warm` fills lanes in order; `hot-even` balances selected experts per layer across lanes while enforcing each lane budget.
`llama_moe_hot_cache_select_gpu_dev()`	Selects the GPU backend device that should own the hot-cache buffer.
`llama_moe_hot_cache_resolve_gpu_dev()`	Resolves optional expert-lane device names independently from normal model offload devices.
`llama_moe_hot_cache_auto_budget_bytes()`	Computes remaining VRAM for `--moe-hot-cache-max-mib -1`, including safety reserve. The primary lane keeps the conservative `1024 MiB` default; second and third worker lanes default to `512 MiB`.

`llama-moe-hot-cache-builder.*`

Converts a plan into live cache tensors and fills the VRAM copy.

Method	Responsibility
`group_selected_by_layer()`	Groups selected experts by layer so each layer can get its own compact cache tensors.
`summarize_selected_layers()`	Creates startup log stats such as active layers, min/max hot experts, and average hot experts.
`copy_expert_slice()`, `copy_scale_slice()`	Copies expert and quantization-scale slices from the original model tensors into the cache buffer.
`set_tensor_i32_1d()`, `set_tensor_f32_1d()`	Updates map and mask tensors during build and dynamic replacement.
`llama_moe_hot_cache_build()`	Allocates contexts, buffers, per-layer tensors, maps, masks, dummy slots, and host maps.
`llama_moe_hot_cache_build_multi()`	Allocates one context and buffer per expert lane, records expert-lane maps, and exposes the lane devices to the runtime scheduler without adding them to normal layer offload.

`llama-moe-hot-cache-worklist.*`

Builds the compact per-layer job list that tells the graph which slots are hot and which are cold.

Method	Responsibility
`llama_moe_hot_cache_build_worklist()`	Consumes already selected Top-K IDs and weights, then writes compact hot/cold slot lists.
`llama_moe_hot_cache_build_worklist_from_logits()`	For tiny decode batches, computes routing directly from logits on CPU and writes the same worklist shape.
`LLAMA_MOE_HOT_CACHE_WORKLIST_FIELD_*`	Defines the tensor fields for hot IDs, cold IDs, source slots, token IDs, weights, and counts. Multi-device decode adds fixed hot1/hot2 fields rather than a generic dynamic lane layout.

`llama-moe-hot-cache-graph.cpp`

Builds the GGML hot/cold graph and keeps the hot and cold branches numerically equivalent to the normal MoE path.

Function	Responsibility
`*_build_worklist_op()`	Wraps the C++ worklist builders as custom GGML nodes.
`llama_moe_hot_cache_graph_phase_from_llm()`	Maps the core `llm_graph_phase` to the hot-cache policy phase.
`sum_prefix_rows`, `sum_weighted_prefix_rows`, `first_row_input`	Decode and prefix-reduce helpers that reduce cold-lane merge overhead.
`set_mul_mat_id_flags()`	Stores hot-cache flags in `ggml_mul_mat_id` operation params.
`build_lora_mm_id()`	Builds LoRA-compatible `mul_mat_id` nodes for selected expert IDs.
`build_moe_ffn_with_ids()`	Common FFN core for hot and cold branches, including gate/up/down variants, activation, scales, weights, and reduce.
`build_moe_hot_from_logits()`	Generic logits-based hot-cache graph used by Gemma4, Qwen3Next, Mellum, GPT-OSS, DeepSeek2-family exports, GLM-DSA, and GLM4 MoE.
`build_moe_hot_multi_from_logits()`	Decode multi-lane graph. Each expert GPU lane computes and locally reduces its assigned cached experts to `[n_embd,n_tokens]`; the cold CPU fallback remains available and the final add follows the normal layer device, which is the primary graph GPU in split-mode none setups.
`llama_moe_hot_cache_build_moe_hot_pp_dense_from_logits()`	Single-lane dense PP graph for logits adapters. It keeps PP hot-cache execution separate from one-token decode shortcuts.
`llama_moe_hot_cache_build_moe_hot_multi_pp_dense_from_logits()`	Multi-lane dense PP graph for up to three cached expert lanes plus the cold lane.
`qwen35 build_layer_ffn_hot()`	Qwen35-specific graph that builds router logits inside the hot-cache path.

`llama-moe-hot-cache-adapter.*`

The compatibility boundary. New models opt in here; unsupported models keep the normal llama.cpp graph.

Item	Responsibility
`ADAPTERS`	Central allow-list for `QWEN35MOE`, `QWEN3NEXT`, `GEMMA4`, `MELLUM`, `OPENAI_MOE`, `DEEPSEEK2`, `GLM_DSA`, and `GLM4_MOE`.
`qwen35_profile()`	Qwen35 profile: qwen-specific graph kind, single-token CPU decode routing, proven decode shortcuts, dense PP, and primary-backend PP cold placement.
`qwen3next_profile()`	Qwen3Next profile: logits graph, tiny multi-token routing up to four tokens, conservative prefix tasks, and dense PP.
`gemma4_profile()`	Gemma4 profile: logits graph, GELU FFN, branch-reduce option, decode merge shortcuts, and dense PP.
`mellum_profile()`	Mellum profile: logits graph, SILU FFN, conservative Qwen-style decode shortcuts, and dense PP.
`openai_moe_profile()`	GPT-OSS profile: logits graph, OpenAI SwiGLU MoE FFN op, single-token decode routing, and shared dense PP behavior.
`deepseek2_profile()`	DeepSeek2-family profile: logits graph with hparams-driven routing semantics, optional expert bias, and dense PP.
`glm_dsa_profile()`	Experimental GLM-DSA profile for GLM-5.2 GGUFs: reuses the DeepSeek2 graph hook while preserving sigmoid, bias, normalization, and scale routing semantics. Validated only with the small GLM-5.2-0.8B-A0.8B GGUF so far; the full GLM-5.2 model is untested.
`glm4_moe_profile()`	Native GLM4 MoE profile: logits graph, SILU experts, GLM routing semantics handled before the generic dense PP worklist.
`llama_moe_hot_cache_graph_tweaks`	Reads env-controlled graph toggles such as parallel mode, merge mode, CPU routing, prefix reduction, and PP reduce merge.
`llama_moe_hot_cache_pp_policy`	Reads PP dense and PP cold-backend overrides and combines them with adapter profile defaults.
`find_model_adapter()`, `supports_graph_kind()`	Enforces that a model can only enter the graph kind it registered for.

`perf.cpp`, `perf-state.cpp`, `perf-json.cpp`

Performance collection pipeline for learning, UI visualization, manual applies, and automatic updates.

Component	Responsibility
`perf-state`	Owns thread-safe counters, shape initialization, overflow protection, and per-layer hit/timing storage.
`perf-nodes`	Classifies GGML nodes by name into routing, worklist, branch, matmul, gather/scatter, merge, and update categories.
`perf-reader`	Reads Top-K and worklist tensors back from backend memory to count experts and slots, including all hot lanes in dense PP multi-lane worklists.
`perf-json`	Serializes the current state into `/moe-layer-perf` JSON, including disabled mode and derived hot/cold slot totals when branch counters are the best available source.
`perf.cpp`	Coordinates modes, eval callback, graph-compute begin/end, scheduler metric collection, and public C API functions.
`full`, `update`, `off`	Runtime modes: full diagnostics, update-only counters, or no hot-cache perf counters.

`llama-moe-hot-cache-updater.*`

Automatic post-request replacement and manual runtime apply. It changes contents and maps, not tensor shapes.

Method	Responsibility
`current_hot_experts()`	Reconstructs which original experts are currently cached in a layer.
`plan_layer_replacements()`	Builds evict/add candidates from current cache contents and observed expert scores.
`sort_replacement_candidates()`	Sorts candidates by gain so high-value replacements happen first.
`update_max_exchange()`	Caps automatic updates by the configured rate; manual `/moe-hot-cache` applies use rate `1.0`.
`update_from_scored_observations()`	Copies new expert slices into existing cache slots and updates hot/cold maps and masks for both automatic and manual replacement.

Model, Server, and Scheduler Hooks

Small integration points connect the separated hot-cache package to llama.cpp runtime.

File	Responsibility
`src/models/qwen35moe.cpp`	Uses the guarded `qwen35_ffn` path when the adapter and layer cache are active.
`src/models/qwen3next.cpp`	Builds router logits, then calls the generic logits hot-cache path when active.
`src/models/gemma4.cpp`	Calls the generic logits hot-cache path with Gemma's GELU FFN operation.
`src/models/mellum.cpp`	Builds router logits, then calls the generic logits hot-cache path with Mellum's SILU FFN operation.
`src/models/deepseek2.cpp`	Builds DeepSeek2-family router logits and bias-adjusted selection inputs, then calls the generic logits hot-cache path when active.
`src/models/glm-dsa.cpp`	Loads GLM-DSA tensors for GLM-5.2 and reuses the DeepSeek2 graph hook guarded by the GLM-DSA adapter. The hot-cache path requires a single expert group; `expert_group_count > 1` is bypassed until grouped router masking is implemented and tested.
`src/models/glm4-moe.cpp`	Builds GLM4 MoE router logits and selection inputs, then calls the generic logits hot-cache path when active.
`src/models/openai-moe.cpp`	Builds the GPT-OSS MoE hot path through the OpenAI MoE graph adapter.
`tools/server/server.cpp`, `server-models.cpp`	Expose GET/POST `/moe-layer-perf` and `/moe-hot-cache`, then route requests to the loaded model.
`tools/server/server-context.cpp`	Writes perf output, marks automatic updates pending, queues manual applies until slots are idle, skips automatic updates while a manual apply is pending, and runs cache replacement.
`ggml-backend-moe-hot-cache.*`	Scheduler-side support for marked hot/cold parallel regions, metrics, and fallback counters.

Unit Tests

Each extracted responsibility has a focused test file so future model additions can be reviewed without broad runtime experiments first.

Test Group	Coverage
`test-moe-hot-cache-parser`	Perf JSON parsing, enabled layer slot parsing, and learn/live JSON compatibility.
`test-moe-hot-cache-weighting`	Mode parsing, flat scoring, pressure/time behavior, and layer-curve effects.
`test-moe-hot-cache-planner`, `budget`, `builder`	Byte accounting, budget fit, grouping, selected layer stats, tensor copy helpers, and cache build behavior.
`test-moe-hot-cache-worklist`	Hot/cold split, compact slot layout, fixed multi-lane routing fields, counts, and logits-based routing output.
`test-moe-hot-cache-adapter`	Registered models, graph-kind isolation, profile defaults, and unsupported graph rejection.
`test-moe-hot-cache-perf-*`	State lifecycle, node classification, dense PP multi-lane tensor readback, and JSON serialization.
`test-moe-hot-cache-updater`	Candidate planning, exchange limits, sorting, and map/mask-safe replacement.
`test-moe-hot-cache-runtime-apply`	Manual runtime apply semantics, full-rate exchange limits, and parser/updater compatibility.
`test-moe-hot-cache`	Aggregate CMake and CTest target for the complete hot-cache test group.

Runtime Flow Per Layer

This flow is why adapters and graph kinds matter: a model may only use the graph path it explicitly registered for.

Prompt processing can send many tokens through the same hot/cold path. Decode is usually one token, while Qwen3Next has shown small batches of up to four tokens.

Senior Engineering Notes

These are the invariants and risk points that matter when reviewing or extending the hot-cache path.

The adapter is the compatibility boundary. Core hooks should only ask whether arch + graph_kind is supported. Model-specific graph decisions belong in llama-moe-hot-cache-adapter.cpp, not in scattered model code.
The scheduler relies on graph shape. The split must remain hot lane, cold lane, then join. If node order changes, the parallel region should fallback instead of silently producing a wrong graph.
The cache is additive memory. The original CPU/RAM tensors are still required. The VRAM cache holds selected expert slices plus maps and masks; dynamic update mutates contents and maps, not tensor shapes.
Prompt processing and decode have different economics. PP amortizes routing and merge work over many tokens. Decode is overhead-sensitive, so tiny shortcuts such as CPU routing and direct merge can matter more than raw matmul speed. The graph phase is explicit, so a one-token PP tail and a tiny multi-token decode batch do not accidentally swap paths.
Perf collection changes runtime cost. Full counters are best for diagnosis. Update mode keeps only what dynamic replacement needs. Off mode should remove measurable counter overhead from the hot path.
Dynamic update must preserve capacity. Updates should exchange experts inside the existing budget. Reallocation during request processing would make latency and OOM behavior much harder to reason about.

Hard graph contractHot branch, cold branch, join. Any mismatch increments fallback counters and should be visible in /moe-layer-perf.

Model isolationQwen, Gemma, Qwen3Next, Mellum, GPT-OSS, DeepSeek2, GLM-DSA, and GLM4 MoE use adapter profiles so one model's shortcut does not leak into another model.

Memory safetyAuto sizing must leave enough reserve for KV, compute buffers, warmup, and optional draft/MTP contexts.

Correctness firstGibberish output is usually a graph-shape, expert-order, weight, or shared-expert integration bug, not a cache hit-rate issue.

Review targetsCheck adapter registration, graph phase propagation, layer-active guard, worklist construction, merge shape, fallback counters, and unit tests.

Recommended validationRun a learn pass, a hot-cache pass with perf full, a pass with perf off, and compare output quality plus TG/PP rates.

Add a New MoE Model

The goal is one new adapter and a very small model hook, with no scattered logic in the llama core.

Confirm the MoE shape. In the model code, locate router logits, n_expert, n_expert_used, and the gate, up, and down expert tensors.
Choose the graph kind. Use logits when the model already has router logits before the MoE FFN. Use a custom graph kind only when the hot graph must build the router logits itself.
Register the adapter. Add the model to ADAPTERS and provide arch, name, graph_kind, and ffn_op. { LLM_ARCH_NEWMODEL, "newmodel", llama_moe_hot_cache_graph_kind::logits, LLM_FFN_SILU }
Encapsulate the profile. Put CPU routing limits, direct merge, branch reduce, dense PP, and PP cold-backend decisions into the adapter profile.
Keep the model hook small. Model code should only check whether the layer is active for this graph kind. The actual hot/cold logic stays in the hot-cache folder. if (llama_moe_hot_cache_layer_active_for_graph(model, il, llama_moe_hot_cache_graph_kind::logits)) { cur_moe = build_layer_moe_hot(cur_moe, logits, il); }
Extend tests. At minimum, tests/test-moe-hot-cache-adapter.cpp must verify the new adapter. Add specialized tests if JSON parsing, weighting, or planning behaves differently.
Validate with a learn run. First write an expert list without the hot cache, then start with the hot cache and inspect /moe-layer-perf for hits, fallbacks, and lane timings.

Source of Truthllama-moe-hot-cache-adapter.cpp decides which model may use which hot graph.

No Side EffectsA Qwen-specific graph kind must not automatically apply to Gemma or a new model.

Fallback Is NormalIf no adapter matches or the layer is inactive, the existing model path continues.

PP Differs From DecodePrompt processing can have large batches and decode usually has tiny batches, but the phase is now explicit instead of being derived only from token count.

Minimum Testctest --test-dir build -R '^test-moe-hot-cache$' --output-on-failure

Buildcmake --build build -j8 --target llama-server

LLM Agent Runbook: Add a Model

This section is written for coding agents. Follow it in order and keep the patch small, reviewable, and isolated.

Identify the MoE implementation before editing. Search the model file for build_moe_ffn, ffn_gate_inp, n_expert, n_expert_used, and expert tensors. Do not assume Qwen, Gemma, Mellum, GPT-OSS, DeepSeek2, GLM, and the new model share tensor layout.
Choose the smallest supported graph kind. Prefer llama_moe_hot_cache_graph_kind::logits when router logits are already available. Create a new graph kind only if the model cannot call the generic logits builder safely.
Add exactly one adapter entry. Edit src/moe-hot-cache/llama-moe-hot-cache-adapter.cpp. Add the LLM_ARCH_*, adapter name, graph kind, and LLM_FFN_* operation. Put model-specific shortcut choices in the profile function.
Add one tiny model hook. In the model file, guard the hot path with llama_moe_hot_cache_layer_active_for_graph(model, il, graph_kind). The hook should route to a hot-cache builder and otherwise leave the existing path unchanged.
Preserve non-MoE branches. Shared experts, dense FFN branches, residual adds, norms, and multimodal side paths must remain exactly where they were unless the model specifically requires otherwise.
Add focused tests. Update tests/test-moe-hot-cache-adapter.cpp so unsupported graph kinds are rejected and the new adapter profile has expected defaults. Add a specialized test only if parser, planner, weighting, or worklist behavior changes.
Validate in three stages. First build and run unit tests. Then run a learn pass to create JSON. Then run hot-cache inference and inspect /moe-layer-perf for hit rate, fallbacks, lane timings, and output quality.
Stop when correctness is unclear. If output becomes gibberish, fallbacks climb, graph shape changes unexpectedly, or shared-expert behavior is uncertain, stop and document the failure instead of adding more shortcuts.

Preferred edit locationssrc/moe-hot-cache/*, one model hook file, and targeted tests/test-moe-hot-cache-*.cpp.

Avoid broad core editsDo not modify scheduler, GGML CUDA kernels, or unrelated model code unless the requested model cannot work without it.

Required guardUse llama_moe_hot_cache_layer_active_for_graph, not a loose architecture check.

Build commandcmake --build build -j8 --target test-moe-hot-cache llama-server

Test commandctest --test-dir build -R '^test-moe-hot-cache$' --output-on-failure

Runtime checksWatch parallel_fallbacks, hot_slots_total, cold_slots_total, branch timings, and whether generated text remains sane.

Patch review questionCould this change alter an existing supported adapter when the new model is not loaded? If yes, isolate it further.

Files for Orientation

File	Responsibility	Plain-English Explanation
`src/moe-hot-cache/llama-moe-hot-cache.cpp`	Lifecycle orchestration and public hot-cache entry points	Coordinates parser, weighting, budget, planner, builder, updater, and layer-active checks.
`src/moe-hot-cache/llama-moe-hot-cache-parser.cpp`	Parser class for perf JSON and slot totals	Turns JSON into typed observations used by cache construction and updates.
`src/moe-hot-cache/llama-moe-hot-cache-adapter.cpp`	Model adapter, graph kind, profile, and tweak defaults	This is the central allow-list: only registered models may use the hot graph.
`src/moe-hot-cache/llama-moe-hot-cache-pp.cpp`	Prompt-processing policy	Combines graph phase, adapter profile, dense PP, cold backend, worklist order, and PP reduce decisions.
`src/moe-hot-cache/llama-moe-hot-cache.h`	Data structures for cache, layers, worklists, and update stats	This is the shared vocabulary of the feature.
`src/moe-hot-cache/llama-moe-hot-cache-graph.cpp`	Hot/cold graph, worklist nodes, and merge paths	Turns the feature design into a GGML graph that can compute.
`src/moe-hot-cache/llama-moe-hot-cache-builder.cpp`	Create cache tensors and copy expert slices	Translates the plan into real VRAM buffers.
`src/moe-hot-cache/llama-moe-hot-cache-worklist.cpp`	Build hot/cold slot lists from router results	Creates the per-layer list consumed by the hot and cold lanes.
`src/moe-hot-cache/llama-moe-hot-cache-perf.cpp`	Perf mode, graph callbacks, and scheduler metric collection	Coordinates when counters are active and connects GGML execution back to perf state.
`src/moe-hot-cache/llama-moe-hot-cache-perf-state.cpp`	Thread-safe state for performance counters	Collects hits, hot/cold slots, timings, and fallback counters.
`src/moe-hot-cache/llama-moe-hot-cache-perf-json.cpp`	JSON serialization for `/moe-layer-perf`	Turns the current perf snapshot into the JSON used by the UI, learn files, manual applies, and automatic updates.
`src/moe-hot-cache/llama-moe-hot-cache-perf-nodes.cpp`	GGML node classification	Maps tensor names to routing, worklist, branch, matmul, gather/scatter, merge, and update timing buckets.
`src/moe-hot-cache/llama-moe-hot-cache-perf-reader.cpp`	Backend tensor readback for perf counters	Reads Top-K and worklist tensors after execution so experts and hot/cold slots can be counted.
`src/moe-hot-cache/llama-moe-hot-cache-weighting.cpp`	Generic expert weighting and layer curves	Turns hits and layer pressure into a cache ranking.
`src/moe-hot-cache/llama-moe-hot-cache-planner.cpp`	Expert sizes and budget selection	Decides which experts fit into the given MiB budget.
`src/moe-hot-cache/llama-moe-hot-cache-budget.cpp`	GPU selection and automatic VRAM budget	When `-1` is used, this computes how much VRAM remains after reserves.
`src/moe-hot-cache/llama-moe-hot-cache-updater.cpp`	Automatic and manual cache replacement	Decides which hot experts are evicted and which experts move in without changing cache tensor shapes.
`src/models/qwen35moe-hot-cache.cpp`	Qwen compatibility wrapper for weighting	Old Qwen API names remain stable while the logic lives in the hot-cache folder.
`src/models/gemma4-hot-cache.cpp`	Gemma4 weighting wrapper	Uses generic weighting while allowing the Gemma-specific layer-curve environment override.
`src/models/qwen35moe.cpp`, `src/models/gemma4.cpp`, `src/models/qwen3next.cpp`, `src/models/mellum.cpp`, `src/models/deepseek2.cpp`, `src/models/glm-dsa.cpp`, `src/models/glm4-moe.cpp`, `src/models/openai-moe.cpp`	Small model hooks	These files should only check whether the adapter allows the matching graph kind.
`ggml/src/ggml-backend-moe-hot-cache.inc`	Scheduler extension for parallel hot/cold regions	Connects the parallel execution path to GGML.
`ggml/include/ggml-backend-moe-hot-cache.h`	Public scheduler metric structures	Defines the parallel-region perf data read by the llama-side perf collector.
`tools/server/server-context.cpp`	Perf file output and runtime cache updates	Marks automatic updates pending, logs hit rate, writes perf output, queues manual applies until slots are idle, and runs slot replacement.
`tools/server/server.cpp`, `tools/server/server-models.cpp`	`/moe-layer-perf` and `/moe-hot-cache` routing	Expose the live perf JSON, runtime perf-mode switch, and manual hot-cache apply endpoint through HTTP.
`common/common.h`, `include/llama.h`	CLI and public parameter storage	Carry hot-cache path, max MiB, auto reserve, weighting, layer curve, and perf output settings into model initialization.
`tests/test-moe-hot-cache-*.cpp`	Specialized unit tests per component	New models should at least get adapter tests, plus targeted tests for special logic. The aggregate target is `test-moe-hot-cache`.