Learn Run
A regular run records which experts are actually used per layer.
Click a block to see why it exists and which file contains the relevant code.
This feature is an additive path. Without the hot cache, llama.cpp runs normally. With the hot cache, the original MoE experts stay in the model and selected experts are additionally copied into VRAM.
Start: --cpu-moe --moe-hot-cache file.json --moe-hot-cache-max-mib -1
A regular run records which experts are actually used per layer.
The JSON is parsed, experts are scored, and the best candidates are packed into a VRAM budget.
For each token, the worklist separates hot slots and cold slots into two parallel lanes.
After a request, weak cache entries can be exchanged for better experts.
--cpu-moe is used.src/moe-hot-cache. The llama.cpp model files should stay as small guarded hooks, so upstream rebases remain manageable.
The boxes show the main building blocks and the direction in which data flows.
llama_moe_hot_cache_perf_json_parser.
It is deliberately separated so learn files, /moe-layer-perf, manual /moe-hot-cache applies, and automatic updates share the same JSON interpretation.
These structs are the feature's internal data contract. Read them from top to bottom as JSON observations becoming a plan, then a runtime cache, then perf data for updates.
llama_moe_hot_cache_entryA ranked expert reference used by weighting and planning.
| Field | Meaning |
|---|---|
layer | Layer index that owns the expert. |
expert | Expert id inside that layer. |
hit_count | Score used for sorting. It may be raw hits or a weighted score. |
llama_moe_hot_cache_expert_observationOne expert row parsed from a learn file or from /moe-layer-perf.
| Field | Meaning |
|---|---|
expert | Expert id in the layer. |
hot | How often this expert was served from the hot cache. |
cold | How often this expert fell through to the cold path. |
raw | Original count when the source has no hot/cold split, usually from a learn run. |
llama_moe_hot_cache_layer_observationLayer-level input for scoring. This is where hit data and timing pressure meet.
| Field | Meaning |
|---|---|
layer | Layer index. |
experts | All observed expert rows for this layer. |
has_branch_counts | Whether hot/cold branch data is available. |
cold_slots_per_call | Average number of cold slots per layer call. |
parallel_*_time_per_call_us | Join wait, hot lane, cold lane, and total MoE timings normalized per call. |
wait_per_cold_slot_us | Pressure signal used to decide which cold misses hurt most. |
llama_moe_hot_cache_planThe planner output. It says what was considered and what fits into the cache budget.
| Field | Meaning |
|---|---|
observed | Ranked expert entries after parsing and weighting. |
selected | Experts selected for the VRAM cache, including byte size. |
budget_bytes | Maximum allowed cache size. |
used_bytes | Actual size used by selected experts. |
llama_moe_hot_cache_expert_sizeMemory accounting for a single expert candidate.
| Field | Meaning |
|---|---|
layer | Layer index. |
expert | Expert id inside that layer. |
bytes | Estimated bytes needed to cache this expert's tensor slices. |
llama_moe_hot_cache_weighting_configConfiguration that controls how observations become cache scores.
| Field | Meaning |
|---|---|
mode | Scoring strategy: flat, pressure, smooth_pressure, time, or balanced. |
layer_curve | How strongly the scoring curve reshapes layer priority. |
llama_moe_hot_cache_layerRuntime cache state for one model layer.
| Field | Meaning |
|---|---|
ffn_*_exps | Cached expert tensors for gate/up/down variants, depending on model layout. |
hot_id_map | Device-side map from original expert id to cache slot id. |
hot_mask, cold_mask | Device-side masks used by graph operations to split hot and cold work. |
hot_id_map_host | Host copy of the expert-to-cache-slot map, used by CPU routing and updates. |
n_hot, n_expert | Number of cached experts and total experts in the layer. |
expert_weights_scale | Optional model scale applied to expert weights. |
llama_moe_hot_cache_worklist_fieldColumn layout of the worklist tensor consumed by graph operations.
| Field Group | Meaning |
|---|---|
HOT_ID, HOT_SRC_SLOT, HOT_TOKEN_ID, HOT_WEIGHT | Compact hot-lane job description. |
COLD_ID, COLD_SRC_SLOT, COLD_TOKEN_ID, COLD_WEIGHT | Compact cold-lane job description. |
HOT_EXPERT_ID | Original expert id for hot-cache accounting and update data. |
HOT_COUNT, COLD_COUNT | Per-call counts used to size and skip branch work. |
llama_moe_hot_cacheTop-level runtime object attached to the model.
| Field | Meaning |
|---|---|
layers | Per-layer cache state. |
ctxs | GGML contexts owning the cache tensors. |
bufs | Backend buffers that hold the actual memory. |
active() | Returns true when at least one layer has a usable cache. |
llama_moe_hot_cache_model_adapterThe model compatibility entry. It keeps model-specific behavior out of generic code.
| Field | Meaning |
|---|---|
arch | llama architecture enum this adapter supports. |
name | Human-readable adapter name for code review and diagnostics. |
graph_kind | Which hot graph shape the model is allowed to use. |
ffn_op | Activation operation used by the MoE FFN path. |
profile() | Returns the optimization profile for this architecture. |
llama_moe_hot_cache_graph_profilePer-model graph optimization flags.
| Field | Meaning |
|---|---|
cpu_decode_routing | Route tiny decode batches on CPU to reduce graph overhead. |
decode_direct_merge, merge_sum_rows | Choose faster merge forms when graph shape allows it. |
cold_prefix_sum, cold_prefix_weighted_sum | Reduce only compact cold prefixes instead of full slot tensors. |
branch_reduce_merge | Let branches reduce before the final merge, useful for some models. |
cpu_decode_routing_max_tokens | Maximum tiny-batch size that may use CPU routing. |
prefix_reduce_tasks_max | Upper bound for CPU tasks used in prefix reduction. |
llama_moe_hot_cache_update_statsSummary logged after dynamic cache replacement.
| Field | Meaning |
|---|---|
active | Whether an update was attempted. |
update_rate | Configured fraction of hot experts that may be exchanged. |
hit_rate | Hot-slot ratio observed in the completed request. |
hot_slots, cold_slots | Total hot and cold expert slots observed. |
candidates, max_exchange | Available replacements and configured exchange budget. |
exchanged, layers_changed | How much the cache actually changed. |
llama_moe_layer_perf_layerPer-layer counters exposed through /moe-layer-perf.
| Field Group | Meaning |
|---|---|
calls, expert_hits_total | How often the layer ran and how many expert slots were selected. |
hot_slots_total, cold_slots_total | Hot/cold split totals used for hit rate and update decisions. |
*_time_us | MoE, routing, worklist, matmul, branch, gather/scatter, and merge timings. |
parallel_* | Scheduler wall time, lane launches, overlap estimate, join wait, and fallback reasons. |
experts, hot_experts, cold_experts | Per-expert counts for visualization, learning, manual applies, and automatic updates. |
llama_moe_layer_perf_stateThread-safe owner for all live MoE performance counters.
| Field | Meaning |
|---|---|
mutex | Protects updates to counters and layer vectors. |
n_expert, n_expert_used | Shape metadata for the currently measured model. |
updates, overflow_resets | Bookkeeping for counter lifecycle and overflow protection. |
active | Whether performance collection is currently enabled. |
layers | Vector of per-layer counters. |
This section supersedes the old Markdown code reference. The old file was written before
the separation into src/moe-hot-cache/; the tables below describe the current
ownership boundaries and entry points.
llama-moe-hot-cache.cppTop-level lifecycle orchestration. It wires parser, weighting, budget, planner, builder, and updater together.
| Entry Point | Responsibility |
|---|---|
llama_moe_hot_cache_init() | Reads the JSON, scores observations, computes the budget, selects experts, and calls the builder. |
llama_moe_hot_cache_init_after_model_load() | Initializes fixed-size caches before context memory when max_mib > 0. |
llama_moe_hot_cache_init_after_context_memory() | Initializes auto-sized caches after context memory when max_mib == -1. |
llama_moe_hot_cache_update_from_perf_json() | Parses live perf data, scores current observations, and delegates replacement to the updater. |
llama_moe_hot_cache_apply_json() | Parses a manual /moe-hot-cache payload and applies the full delta budget independent of the automatic update rate. |
llama_moe_hot_cache_layer_active_for_graph() | Checks adapter support, layer bounds, runtime cache presence, and graph kind compatibility. |
llama-moe-hot-cache-parser.*The JSON parser class shared by learn files, /moe-layer-perf, manual /moe-hot-cache applies, and automatic updates.
| Method | Responsibility |
|---|---|
parse_observations() | Reads layers, experts, hot_experts, cold_experts, and timing fields into typed observations. |
parse_enabled_layer_slots() | Reads only the hot/cold slot totals needed to report hit rate and update stats. |
llama_moe_hot_cache_perf_json_layer_slots | Compact layer, hot-slot, cold-slot tuple used by the updater. |
llama-moe-hot-cache-weighting.cppTurns observations into sortable expert scores. This is model-neutral and used by initial fill and dynamic update.
| Method / Mode | Responsibility |
|---|---|
parse_mode(), mode_name() | Maps CLI/env names to canonical weighting modes. |
default_config() | Builds the default config; current default is flat with layer curve 0.5. |
score_observations() | Public scoring entry used by cache init and update. |
flat | Interleaves best experts across layers and is the current default because it gives stable layer coverage under tight VRAM. |
pressure, smooth_pressure, time, balanced | Alternative curves that favor slow layers, smooth outliers, total MoE time, or stronger per-layer distribution. |
planner.cpp and budget.cppMemory accounting and budget selection. These files decide what can physically fit.
| Method | Responsibility |
|---|---|
llama_moe_hot_cache_tensor_expert_bytes() | Computes the byte cost of one expert slice from a tensor. |
llama_moe_hot_cache_collect_expert_sizes() | Collects candidate expert sizes from model layers that actually have MoE expert tensors. |
llama_moe_hot_cache_select() | Packs ranked experts into the byte budget while accounting for per-layer dummy padding. |
llama_moe_hot_cache_select_gpu_dev() | Selects the GPU backend device that should own the hot-cache buffer. |
llama_moe_hot_cache_auto_budget_bytes() | Computes remaining VRAM for --moe-hot-cache-max-mib -1, including safety reserve. |
llama-moe-hot-cache-builder.*Converts a plan into live cache tensors and fills the VRAM copy.
| Method | Responsibility |
|---|---|
group_selected_by_layer() | Groups selected experts by layer so each layer can get its own compact cache tensors. |
summarize_selected_layers() | Creates startup log stats such as active layers, min/max hot experts, and average hot experts. |
copy_expert_slice(), copy_scale_slice() | Copies expert and quantization-scale slices from the original model tensors into the cache buffer. |
set_tensor_i32_1d(), set_tensor_f32_1d() | Updates map and mask tensors during build and dynamic replacement. |
llama_moe_hot_cache_build() | Allocates contexts, buffers, per-layer tensors, maps, masks, dummy slots, and host maps. |
llama-moe-hot-cache-worklist.*Builds the compact per-layer job list that tells the graph which slots are hot and which are cold.
| Method | Responsibility |
|---|---|
llama_moe_hot_cache_build_worklist() | Consumes already selected Top-K IDs and weights, then writes compact hot/cold slot lists. |
llama_moe_hot_cache_build_worklist_from_logits() | For tiny decode batches, computes routing directly from logits on CPU and writes the same worklist shape. |
LLAMA_MOE_HOT_CACHE_WORKLIST_FIELD_* | Defines the tensor fields for hot IDs, cold IDs, source slots, token IDs, weights, and counts. |
llama-moe-hot-cache-graph.cppBuilds the GGML hot/cold graph and keeps the hot and cold branches numerically equivalent to the normal MoE path.
| Function | Responsibility |
|---|---|
*_build_worklist_op() | Wraps the C++ worklist builders as custom GGML nodes. |
sum_prefix_rows, sum_weighted_prefix_rows, first_row_input | Decode and prefix-reduce helpers that reduce cold-lane merge overhead. |
set_mul_mat_id_flags() | Stores hot-cache flags in ggml_mul_mat_id operation params. |
build_lora_mm_id() | Builds LoRA-compatible mul_mat_id nodes for selected expert IDs. |
build_moe_ffn_with_ids() | Common FFN core for hot and cold branches, including gate/up/down variants, activation, scales, weights, and reduce. |
build_moe_hot_from_logits() | Generic logits-based hot-cache graph used by Gemma4 and Qwen3Next. |
qwen35 build_layer_ffn_hot() | Qwen35-specific graph that builds router logits inside the hot-cache path. |
llama-moe-hot-cache-adapter.*The compatibility boundary. New models opt in here; unsupported models keep the normal llama.cpp graph.
| Item | Responsibility |
|---|---|
ADAPTERS | Central allow-list for QWEN35MOE, QWEN3NEXT, and GEMMA4. |
qwen35_profile() | Qwen35 profile: qwen-specific graph kind, single-token CPU decode routing, and proven decode shortcuts. |
qwen3next_profile() | Qwen3Next profile: logits graph, tiny multi-token routing up to four tokens, and conservative prefix tasks. |
gemma4_profile() | Gemma4 profile: logits graph, GELU FFN, branch-reduce option, and decode merge shortcuts. |
llama_moe_hot_cache_graph_tweaks | Reads env-controlled graph toggles such as parallel mode, merge mode, CPU routing, prefix reduction, and PP reduce merge. |
find_model_adapter(), supports_graph_kind() | Enforces that a model can only enter the graph kind it registered for. |
perf.cpp, perf-state.cpp, perf-json.cppPerformance collection pipeline for learning, UI visualization, manual applies, and automatic updates.
| Component | Responsibility |
|---|---|
perf-state | Owns thread-safe counters, shape initialization, overflow protection, and per-layer hit/timing storage. |
perf-nodes | Classifies GGML nodes by name into routing, worklist, branch, matmul, gather/scatter, merge, and update categories. |
perf-reader | Reads Top-K and worklist tensors back from backend memory to count experts and slots. |
perf-json | Serializes the current state into /moe-layer-perf JSON, including disabled mode. |
perf.cpp | Coordinates modes, eval callback, graph-compute begin/end, scheduler metric collection, and public C API functions. |
full, update, off | Runtime modes: full diagnostics, update-only counters, or no hot-cache perf counters. |
llama-moe-hot-cache-updater.*Automatic post-request replacement and manual runtime apply. It changes contents and maps, not tensor shapes.
| Method | Responsibility |
|---|---|
current_hot_experts() | Reconstructs which original experts are currently cached in a layer. |
plan_layer_replacements() | Builds evict/add candidates from current cache contents and observed expert scores. |
sort_replacement_candidates() | Sorts candidates by gain so high-value replacements happen first. |
update_max_exchange() | Caps automatic updates by the configured rate; manual /moe-hot-cache applies use rate 1.0. |
update_from_scored_observations() | Copies new expert slices into existing cache slots and updates hot/cold maps and masks for both automatic and manual replacement. |
Small integration points connect the separated hot-cache package to llama.cpp runtime.
| File | Responsibility |
|---|---|
src/models/qwen35moe.cpp | Uses the guarded qwen35_ffn path when the adapter and layer cache are active. |
src/models/qwen3next.cpp | Builds router logits, then calls the generic logits hot-cache path when active. |
src/models/gemma4.cpp | Calls the generic logits hot-cache path with Gemma's GELU FFN operation. |
tools/server/server.cpp, server-models.cpp | Expose GET/POST /moe-layer-perf and /moe-hot-cache, then route requests to the loaded model. |
tools/server/server-context.cpp | Writes perf output, marks automatic updates pending, queues manual applies until slots are idle, skips automatic updates while a manual apply is pending, and runs cache replacement. |
ggml-backend-moe-hot-cache.* | Scheduler-side support for marked hot/cold parallel regions, metrics, and fallback counters. |
Each extracted responsibility has a focused test file so future model additions can be reviewed without broad runtime experiments first.
| Test Group | Coverage |
|---|---|
test-moe-hot-cache-parser | Perf JSON parsing, enabled layer slot parsing, and learn/live JSON compatibility. |
test-moe-hot-cache-weighting | Mode parsing, flat scoring, pressure/time behavior, and layer-curve effects. |
test-moe-hot-cache-planner, budget, builder | Byte accounting, budget fit, grouping, selected layer stats, tensor copy helpers, and cache build behavior. |
test-moe-hot-cache-worklist | Hot/cold split, compact slot layout, counts, and logits-based routing output. |
test-moe-hot-cache-adapter | Registered models, graph-kind isolation, profile defaults, and unsupported graph rejection. |
test-moe-hot-cache-perf-* | State lifecycle, node classification, tensor readback, and JSON serialization. |
test-moe-hot-cache-updater | Candidate planning, exchange limits, sorting, and map/mask-safe replacement. |
test-moe-hot-cache-runtime-apply | Manual runtime apply semantics, full-rate exchange limits, and parser/updater compatibility. |
test-moe-hot-cache | Aggregate CMake and CTest target for the complete hot-cache test group. |
This flow is why adapters and graph kinds matter: a model may only use the graph path it explicitly registered for.
These are the invariants and risk points that matter when reviewing or extending the hot-cache path.
arch + graph_kind is supported. Model-specific graph decisions belong in llama-moe-hot-cache-adapter.cpp, not in scattered model code.
/moe-layer-perf.The goal is one new adapter and a very small model hook, with no scattered logic in the llama core.
n_expert, n_expert_used, and the gate, up, and down expert tensors.
logits when the model already has router logits before the MoE FFN. Use a custom graph kind only when the hot graph must build the router logits itself.
ADAPTERS and provide arch, name, graph_kind, and ffn_op.
{ LLM_ARCH_NEWMODEL, "newmodel", llama_moe_hot_cache_graph_kind::logits, LLM_FFN_SILU }
if (llama_moe_hot_cache_layer_active_for_graph(model, il, llama_moe_hot_cache_graph_kind::logits)) {
cur_moe = build_layer_moe_hot(cur_moe, logits, il);
}
tests/test-moe-hot-cache-adapter.cpp must verify the new adapter. Add specialized tests if JSON parsing, weighting, or planning behaves differently.
/moe-layer-perf for hits, fallbacks, and lane timings.
llama-moe-hot-cache-adapter.cpp decides which model may use which hot graph.ctest --test-dir build -R '^test-moe-hot-cache$' --output-on-failurecmake --build build -j8 --target llama-serverThis section is written for coding agents. Follow it in order and keep the patch small, reviewable, and isolated.
build_moe_ffn, ffn_gate_inp, n_expert, n_expert_used, and expert tensors. Do not assume Qwen, Gemma, and the new model share tensor layout.
llama_moe_hot_cache_graph_kind::logits when router logits are already available. Create a new graph kind only if the model cannot call the generic logits builder safely.
src/moe-hot-cache/llama-moe-hot-cache-adapter.cpp. Add the LLM_ARCH_*, adapter name, graph kind, and LLM_FFN_* operation. Put model-specific shortcut choices in the profile function.
llama_moe_hot_cache_layer_active_for_graph(model, il, graph_kind). The hook should route to a hot-cache builder and otherwise leave the existing path unchanged.
tests/test-moe-hot-cache-adapter.cpp so unsupported graph kinds are rejected and the new adapter profile has expected defaults. Add a specialized test only if parser, planner, weighting, or worklist behavior changes.
/moe-layer-perf for hit rate, fallbacks, lane timings, and output quality.
src/moe-hot-cache/*, one model hook file, and targeted tests/test-moe-hot-cache-*.cpp.llama_moe_hot_cache_layer_active_for_graph, not a loose architecture check.cmake --build build -j8 --target test-moe-hot-cache llama-serverctest --test-dir build -R '^test-moe-hot-cache$' --output-on-failureparallel_fallbacks, hot_slots_total, cold_slots_total, branch timings, and whether generated text remains sane.| File | Responsibility | Plain-English Explanation |
|---|---|---|
src/moe-hot-cache/llama-moe-hot-cache.cpp |
Lifecycle orchestration and public hot-cache entry points | Coordinates parser, weighting, budget, planner, builder, updater, and layer-active checks. |
src/moe-hot-cache/llama-moe-hot-cache-parser.cpp |
Parser class for perf JSON and slot totals | Turns JSON into typed observations used by cache construction and updates. |
src/moe-hot-cache/llama-moe-hot-cache-adapter.cpp |
Model adapter, graph kind, profile, and tweak defaults | This is the central allow-list: only registered models may use the hot graph. |
src/moe-hot-cache/llama-moe-hot-cache.h |
Data structures for cache, layers, worklists, and update stats | This is the shared vocabulary of the feature. |
src/moe-hot-cache/llama-moe-hot-cache-graph.cpp |
Hot/cold graph, worklist nodes, and merge paths | Turns the feature design into a GGML graph that can compute. |
src/moe-hot-cache/llama-moe-hot-cache-builder.cpp |
Create cache tensors and copy expert slices | Translates the plan into real VRAM buffers. |
src/moe-hot-cache/llama-moe-hot-cache-worklist.cpp |
Build hot/cold slot lists from router results | Creates the per-layer list consumed by the hot and cold lanes. |
src/moe-hot-cache/llama-moe-hot-cache-perf.cpp |
Perf mode, graph callbacks, and scheduler metric collection | Coordinates when counters are active and connects GGML execution back to perf state. |
src/moe-hot-cache/llama-moe-hot-cache-perf-state.cpp |
Thread-safe state for performance counters | Collects hits, hot/cold slots, timings, and fallback counters. |
src/moe-hot-cache/llama-moe-hot-cache-perf-json.cpp |
JSON serialization for /moe-layer-perf |
Turns the current perf snapshot into the JSON used by the UI, learn files, manual applies, and automatic updates. |
src/moe-hot-cache/llama-moe-hot-cache-perf-nodes.cpp |
GGML node classification | Maps tensor names to routing, worklist, branch, matmul, gather/scatter, merge, and update timing buckets. |
src/moe-hot-cache/llama-moe-hot-cache-perf-reader.cpp |
Backend tensor readback for perf counters | Reads Top-K and worklist tensors after execution so experts and hot/cold slots can be counted. |
src/moe-hot-cache/llama-moe-hot-cache-weighting.cpp |
Generic expert weighting and layer curves | Turns hits and layer pressure into a cache ranking. |
src/moe-hot-cache/llama-moe-hot-cache-planner.cpp |
Expert sizes and budget selection | Decides which experts fit into the given MiB budget. |
src/moe-hot-cache/llama-moe-hot-cache-budget.cpp |
GPU selection and automatic VRAM budget | When -1 is used, this computes how much VRAM remains after reserves. |
src/moe-hot-cache/llama-moe-hot-cache-updater.cpp |
Automatic and manual cache replacement | Decides which hot experts are evicted and which experts move in without changing cache tensor shapes. |
src/models/qwen35moe-hot-cache.cpp |
Qwen compatibility wrapper for weighting | Old Qwen API names remain stable while the logic lives in the hot-cache folder. |
src/models/gemma4-hot-cache.cpp |
Gemma4 weighting wrapper | Uses generic weighting while allowing the Gemma-specific layer-curve environment override. |
src/models/qwen35moe.cpp, src/models/gemma4.cpp, src/models/qwen3next.cpp |
Small model hooks | These files should only check whether the adapter allows the matching graph kind. |
ggml/src/ggml-backend-moe-hot-cache.inc |
Scheduler extension for parallel hot/cold regions | Connects the parallel execution path to GGML. |
ggml/include/ggml-backend-moe-hot-cache.h |
Public scheduler metric structures | Defines the parallel-region perf data read by the llama-side perf collector. |
tools/server/server-context.cpp |
Perf file output and runtime cache updates | Marks automatic updates pending, logs hit rate, writes perf output, queues manual applies until slots are idle, and runs slot replacement. |
tools/server/server.cpp, tools/server/server-models.cpp |
/moe-layer-perf and /moe-hot-cache routing |
Expose the live perf JSON, runtime perf-mode switch, and manual hot-cache apply endpoint through HTTP. |
common/common.h, include/llama.h |
CLI and public parameter storage | Carry hot-cache path, max MiB, auto reserve, weighting, layer curve, and perf output settings into model initialization. |
tests/test-moe-hot-cache-*.cpp |
Specialized unit tests per component | New models should at least get adapter tests, plus targeted tests for special logic. The aggregate target is test-moe-hot-cache. |