Overall Architecture

Click a block to see why it exists and which file contains the relevant code.

GPU Hot CPU Cold Metrics/Update Graph/Scheduler
Perf JSON /moe-layer-perf expert hits Parser reads layers and experts Planner scores experts fits budget Hot Cache selected slices in VRAM Token current input for one layer Router chooses top-k expert ids Worklist dispatches hot/cold slots Hot Lane CUDA0 cached experts Cold Lane CPU/RAM original experts Join/Merge parallel region then add output Next layer Counters hits + timings Update exchange slots Normal fallback path

Block Explanation

Overview

This feature is an additive path. Without the hot cache, llama.cpp runs normally. With the hot cache, the original MoE experts stay in the model and selected experts are additionally copied into VRAM.

Start: --cpu-moe --moe-hot-cache file.json --moe-hot-cache-max-mib -1
1

Learn Run

A regular run records which experts are actually used per layer.

2

Server Start

The JSON is parsed, experts are scored, and the best candidates are packed into a VRAM budget.

3

Decode

For each token, the worklist separates hot slots and cold slots into two parallel lanes.

4

Update

After a request, weak cache entries can be exchanged for better experts.

Where Does Each Part Live?

llama.cpp Touch Points
Model HooksSmall guards call the hot path only when the adapter allows this graph kind.
SchedulerRecognizes the marked hot/cold region and overlaps the branches when possible.
Server HookExposes perf JSON and triggers optional post-request cache updates.
Hot-Cache Package
AdapterOwns model support, graph kind, FFN op, and profile defaults.
Parser/PlannerReads perf JSON, scores experts, sizes candidates, and fits the budget.
Builder/RuntimeCreates cache tensors, copies expert slices, and builds worklists.
Perf and Update
Perf StateCollects hits, hot/cold slots, timings, fallback reasons, and expert arrays.
UpdaterExchanges cache entries inside the existing capacity after a request.
TestsComponent tests cover parser, weighting, planner, builder, worklist, adapter, budget, and perf.
Hardware
CUDA CacheComputes selected hot expert slices from the extra VRAM cache.
CPU/RAMKeeps the original experts and computes cold misses when --cpu-moe is used.
VRAM BudgetLimits cache size and is controlled by explicit MiB or auto sizing with reserve.
Current rule: most feature logic lives under src/moe-hot-cache. The llama.cpp model files should stay as small guarded hooks, so upstream rebases remain manageable.

Before and After

Without Feature
RouterSelects experts.
MoE FFNAll selected experts run through the normal path.
OutputThe result moves to the next layer.
With Hot Cache
WorklistSplits expert slots into hot and cold work.
ParallelGPU and CPU work at the same time.
MergeBoth results are added back together.
Performance improves only when enough slots are hot and the cold lane does not trail too far behind.

Class Diagram

The boxes show the main building blocks and the direction in which data flows.

Perf JSON Parser parser.h/.cpp + parse_observations() + parse_enabled_layer_slots() reads learn + live JSON Weighting weighting.cpp + score_observations() modes: flat, pressure, time ranks experts by value Planner planner.h/.cpp + collect_expert_sizes() + llama_moe_hot_cache_select() fits ranking into MiB budget Builder builder.h/.cpp + build() + copy_expert_slice() fills cache tensors Model Adapter adapter.h/.cpp arch, name, graph_kind ffn_op, profile() central model allow-list Graph Profile graph_profile struct cpu_decode_routing direct_merge, branch_reduce model-specific optimizations Hot Cache Runtime llama_moe_hot_cache layers, ctxs, bufs hot_id_map, masks VRAM copy per layer Graph Builder graph.cpp build worklist hot lane, cold lane merge output Perf State perf-state.h/.cpp experts, hot_experts timings, fallbacks basis for UI + update Updater updater.h/.cpp + plan_layer_replacements() + update_from_scored() exchanges cache slots
The parser lives in llama_moe_hot_cache_perf_json_parser. It is deliberately separated so learn files, /moe-layer-perf, manual /moe-hot-cache applies, and automatic updates share the same JSON interpretation.

Data Models

These structs are the feature's internal data contract. Read them from top to bottom as JSON observations becoming a plan, then a runtime cache, then perf data for updates.

llama_moe_hot_cache_entry

A ranked expert reference used by weighting and planning.

FieldMeaning
layerLayer index that owns the expert.
expertExpert id inside that layer.
hit_countScore used for sorting. It may be raw hits or a weighted score.

llama_moe_hot_cache_expert_observation

One expert row parsed from a learn file or from /moe-layer-perf.

FieldMeaning
expertExpert id in the layer.
hotHow often this expert was served from the hot cache.
coldHow often this expert fell through to the cold path.
rawOriginal count when the source has no hot/cold split, usually from a learn run.

llama_moe_hot_cache_layer_observation

Layer-level input for scoring. This is where hit data and timing pressure meet.

FieldMeaning
layerLayer index.
expertsAll observed expert rows for this layer.
has_branch_countsWhether hot/cold branch data is available.
cold_slots_per_callAverage number of cold slots per layer call.
parallel_*_time_per_call_usJoin wait, hot lane, cold lane, and total MoE timings normalized per call.
wait_per_cold_slot_usPressure signal used to decide which cold misses hurt most.

llama_moe_hot_cache_plan

The planner output. It says what was considered and what fits into the cache budget.

FieldMeaning
observedRanked expert entries after parsing and weighting.
selectedExperts selected for the VRAM cache, including byte size.
budget_bytesMaximum allowed cache size.
used_bytesActual size used by selected experts.

llama_moe_hot_cache_expert_size

Memory accounting for a single expert candidate.

FieldMeaning
layerLayer index.
expertExpert id inside that layer.
bytesEstimated bytes needed to cache this expert's tensor slices.

llama_moe_hot_cache_weighting_config

Configuration that controls how observations become cache scores.

FieldMeaning
modeScoring strategy: flat, pressure, smooth_pressure, time, or balanced.
layer_curveHow strongly the scoring curve reshapes layer priority.

llama_moe_hot_cache_layer

Runtime cache state for one model layer.

FieldMeaning
ffn_*_expsCached expert tensors for gate/up/down variants, depending on model layout.
hot_id_mapDevice-side map from original expert id to cache slot id.
hot_mask, cold_maskDevice-side masks used by graph operations to split hot and cold work.
hot_id_map_hostHost copy of the expert-to-cache-slot map, used by CPU routing and updates.
n_hot, n_expertNumber of cached experts and total experts in the layer.
expert_weights_scaleOptional model scale applied to expert weights.

llama_moe_hot_cache_worklist_field

Column layout of the worklist tensor consumed by graph operations.

Field GroupMeaning
HOT_ID, HOT_SRC_SLOT, HOT_TOKEN_ID, HOT_WEIGHTCompact hot-lane job description.
COLD_ID, COLD_SRC_SLOT, COLD_TOKEN_ID, COLD_WEIGHTCompact cold-lane job description.
HOT_EXPERT_IDOriginal expert id for hot-cache accounting and update data.
HOT_COUNT, COLD_COUNTPer-call counts used to size and skip branch work.

llama_moe_hot_cache

Top-level runtime object attached to the model.

FieldMeaning
layersPer-layer cache state.
ctxsGGML contexts owning the cache tensors.
bufsBackend buffers that hold the actual memory.
active()Returns true when at least one layer has a usable cache.

llama_moe_hot_cache_model_adapter

The model compatibility entry. It keeps model-specific behavior out of generic code.

FieldMeaning
archllama architecture enum this adapter supports.
nameHuman-readable adapter name for code review and diagnostics.
graph_kindWhich hot graph shape the model is allowed to use.
ffn_opActivation operation used by the MoE FFN path.
profile()Returns the optimization profile for this architecture.

llama_moe_hot_cache_graph_profile

Per-model graph optimization flags.

FieldMeaning
cpu_decode_routingRoute tiny decode batches on CPU to reduce graph overhead.
decode_direct_merge, merge_sum_rowsChoose faster merge forms when graph shape allows it.
cold_prefix_sum, cold_prefix_weighted_sumReduce only compact cold prefixes instead of full slot tensors.
branch_reduce_mergeLet branches reduce before the final merge, useful for some models.
cpu_decode_routing_max_tokensMaximum tiny-batch size that may use CPU routing.
prefix_reduce_tasks_maxUpper bound for CPU tasks used in prefix reduction.

llama_moe_hot_cache_update_stats

Summary logged after dynamic cache replacement.

FieldMeaning
activeWhether an update was attempted.
update_rateConfigured fraction of hot experts that may be exchanged.
hit_rateHot-slot ratio observed in the completed request.
hot_slots, cold_slotsTotal hot and cold expert slots observed.
candidates, max_exchangeAvailable replacements and configured exchange budget.
exchanged, layers_changedHow much the cache actually changed.

llama_moe_layer_perf_layer

Per-layer counters exposed through /moe-layer-perf.

Field GroupMeaning
calls, expert_hits_totalHow often the layer ran and how many expert slots were selected.
hot_slots_total, cold_slots_totalHot/cold split totals used for hit rate and update decisions.
*_time_usMoE, routing, worklist, matmul, branch, gather/scatter, and merge timings.
parallel_* Scheduler wall time, lane launches, overlap estimate, join wait, and fallback reasons.
experts, hot_experts, cold_expertsPer-expert counts for visualization, learning, manual applies, and automatic updates.

llama_moe_layer_perf_state

Thread-safe owner for all live MoE performance counters.

FieldMeaning
mutexProtects updates to counters and layer vectors.
n_expert, n_expert_usedShape metadata for the currently measured model.
updates, overflow_resetsBookkeeping for counter lifecycle and overflow protection.
activeWhether performance collection is currently enabled.
layersVector of per-layer counters.

Code Reference

This section supersedes the old Markdown code reference. The old file was written before the separation into src/moe-hot-cache/; the tables below describe the current ownership boundaries and entry points.

llama-moe-hot-cache.cpp

Top-level lifecycle orchestration. It wires parser, weighting, budget, planner, builder, and updater together.

Entry PointResponsibility
llama_moe_hot_cache_init()Reads the JSON, scores observations, computes the budget, selects experts, and calls the builder.
llama_moe_hot_cache_init_after_model_load()Initializes fixed-size caches before context memory when max_mib > 0.
llama_moe_hot_cache_init_after_context_memory()Initializes auto-sized caches after context memory when max_mib == -1.
llama_moe_hot_cache_update_from_perf_json()Parses live perf data, scores current observations, and delegates replacement to the updater.
llama_moe_hot_cache_apply_json()Parses a manual /moe-hot-cache payload and applies the full delta budget independent of the automatic update rate.
llama_moe_hot_cache_layer_active_for_graph()Checks adapter support, layer bounds, runtime cache presence, and graph kind compatibility.

llama-moe-hot-cache-parser.*

The JSON parser class shared by learn files, /moe-layer-perf, manual /moe-hot-cache applies, and automatic updates.

MethodResponsibility
parse_observations()Reads layers, experts, hot_experts, cold_experts, and timing fields into typed observations.
parse_enabled_layer_slots()Reads only the hot/cold slot totals needed to report hit rate and update stats.
llama_moe_hot_cache_perf_json_layer_slotsCompact layer, hot-slot, cold-slot tuple used by the updater.

llama-moe-hot-cache-weighting.cpp

Turns observations into sortable expert scores. This is model-neutral and used by initial fill and dynamic update.

Method / ModeResponsibility
parse_mode(), mode_name()Maps CLI/env names to canonical weighting modes.
default_config()Builds the default config; current default is flat with layer curve 0.5.
score_observations()Public scoring entry used by cache init and update.
flatInterleaves best experts across layers and is the current default because it gives stable layer coverage under tight VRAM.
pressure, smooth_pressure, time, balancedAlternative curves that favor slow layers, smooth outliers, total MoE time, or stronger per-layer distribution.

planner.cpp and budget.cpp

Memory accounting and budget selection. These files decide what can physically fit.

MethodResponsibility
llama_moe_hot_cache_tensor_expert_bytes()Computes the byte cost of one expert slice from a tensor.
llama_moe_hot_cache_collect_expert_sizes()Collects candidate expert sizes from model layers that actually have MoE expert tensors.
llama_moe_hot_cache_select()Packs ranked experts into the byte budget while accounting for per-layer dummy padding.
llama_moe_hot_cache_select_gpu_dev()Selects the GPU backend device that should own the hot-cache buffer.
llama_moe_hot_cache_auto_budget_bytes()Computes remaining VRAM for --moe-hot-cache-max-mib -1, including safety reserve.

llama-moe-hot-cache-builder.*

Converts a plan into live cache tensors and fills the VRAM copy.

MethodResponsibility
group_selected_by_layer()Groups selected experts by layer so each layer can get its own compact cache tensors.
summarize_selected_layers()Creates startup log stats such as active layers, min/max hot experts, and average hot experts.
copy_expert_slice(), copy_scale_slice()Copies expert and quantization-scale slices from the original model tensors into the cache buffer.
set_tensor_i32_1d(), set_tensor_f32_1d()Updates map and mask tensors during build and dynamic replacement.
llama_moe_hot_cache_build()Allocates contexts, buffers, per-layer tensors, maps, masks, dummy slots, and host maps.

llama-moe-hot-cache-worklist.*

Builds the compact per-layer job list that tells the graph which slots are hot and which are cold.

MethodResponsibility
llama_moe_hot_cache_build_worklist()Consumes already selected Top-K IDs and weights, then writes compact hot/cold slot lists.
llama_moe_hot_cache_build_worklist_from_logits()For tiny decode batches, computes routing directly from logits on CPU and writes the same worklist shape.
LLAMA_MOE_HOT_CACHE_WORKLIST_FIELD_*Defines the tensor fields for hot IDs, cold IDs, source slots, token IDs, weights, and counts.

llama-moe-hot-cache-graph.cpp

Builds the GGML hot/cold graph and keeps the hot and cold branches numerically equivalent to the normal MoE path.

FunctionResponsibility
*_build_worklist_op()Wraps the C++ worklist builders as custom GGML nodes.
sum_prefix_rows, sum_weighted_prefix_rows, first_row_inputDecode and prefix-reduce helpers that reduce cold-lane merge overhead.
set_mul_mat_id_flags()Stores hot-cache flags in ggml_mul_mat_id operation params.
build_lora_mm_id()Builds LoRA-compatible mul_mat_id nodes for selected expert IDs.
build_moe_ffn_with_ids()Common FFN core for hot and cold branches, including gate/up/down variants, activation, scales, weights, and reduce.
build_moe_hot_from_logits()Generic logits-based hot-cache graph used by Gemma4 and Qwen3Next.
qwen35 build_layer_ffn_hot()Qwen35-specific graph that builds router logits inside the hot-cache path.

llama-moe-hot-cache-adapter.*

The compatibility boundary. New models opt in here; unsupported models keep the normal llama.cpp graph.

ItemResponsibility
ADAPTERSCentral allow-list for QWEN35MOE, QWEN3NEXT, and GEMMA4.
qwen35_profile()Qwen35 profile: qwen-specific graph kind, single-token CPU decode routing, and proven decode shortcuts.
qwen3next_profile()Qwen3Next profile: logits graph, tiny multi-token routing up to four tokens, and conservative prefix tasks.
gemma4_profile()Gemma4 profile: logits graph, GELU FFN, branch-reduce option, and decode merge shortcuts.
llama_moe_hot_cache_graph_tweaksReads env-controlled graph toggles such as parallel mode, merge mode, CPU routing, prefix reduction, and PP reduce merge.
find_model_adapter(), supports_graph_kind()Enforces that a model can only enter the graph kind it registered for.

perf.cpp, perf-state.cpp, perf-json.cpp

Performance collection pipeline for learning, UI visualization, manual applies, and automatic updates.

ComponentResponsibility
perf-stateOwns thread-safe counters, shape initialization, overflow protection, and per-layer hit/timing storage.
perf-nodesClassifies GGML nodes by name into routing, worklist, branch, matmul, gather/scatter, merge, and update categories.
perf-readerReads Top-K and worklist tensors back from backend memory to count experts and slots.
perf-jsonSerializes the current state into /moe-layer-perf JSON, including disabled mode.
perf.cppCoordinates modes, eval callback, graph-compute begin/end, scheduler metric collection, and public C API functions.
full, update, offRuntime modes: full diagnostics, update-only counters, or no hot-cache perf counters.

llama-moe-hot-cache-updater.*

Automatic post-request replacement and manual runtime apply. It changes contents and maps, not tensor shapes.

MethodResponsibility
current_hot_experts()Reconstructs which original experts are currently cached in a layer.
plan_layer_replacements()Builds evict/add candidates from current cache contents and observed expert scores.
sort_replacement_candidates()Sorts candidates by gain so high-value replacements happen first.
update_max_exchange()Caps automatic updates by the configured rate; manual /moe-hot-cache applies use rate 1.0.
update_from_scored_observations()Copies new expert slices into existing cache slots and updates hot/cold maps and masks for both automatic and manual replacement.

Model, Server, and Scheduler Hooks

Small integration points connect the separated hot-cache package to llama.cpp runtime.

FileResponsibility
src/models/qwen35moe.cppUses the guarded qwen35_ffn path when the adapter and layer cache are active.
src/models/qwen3next.cppBuilds router logits, then calls the generic logits hot-cache path when active.
src/models/gemma4.cppCalls the generic logits hot-cache path with Gemma's GELU FFN operation.
tools/server/server.cpp, server-models.cppExpose GET/POST /moe-layer-perf and /moe-hot-cache, then route requests to the loaded model.
tools/server/server-context.cppWrites perf output, marks automatic updates pending, queues manual applies until slots are idle, skips automatic updates while a manual apply is pending, and runs cache replacement.
ggml-backend-moe-hot-cache.*Scheduler-side support for marked hot/cold parallel regions, metrics, and fallback counters.

Unit Tests

Each extracted responsibility has a focused test file so future model additions can be reviewed without broad runtime experiments first.

Test GroupCoverage
test-moe-hot-cache-parserPerf JSON parsing, enabled layer slot parsing, and learn/live JSON compatibility.
test-moe-hot-cache-weightingMode parsing, flat scoring, pressure/time behavior, and layer-curve effects.
test-moe-hot-cache-planner, budget, builderByte accounting, budget fit, grouping, selected layer stats, tensor copy helpers, and cache build behavior.
test-moe-hot-cache-worklistHot/cold split, compact slot layout, counts, and logits-based routing output.
test-moe-hot-cache-adapterRegistered models, graph-kind isolation, profile defaults, and unsupported graph rejection.
test-moe-hot-cache-perf-*State lifecycle, node classification, tensor readback, and JSON serialization.
test-moe-hot-cache-updaterCandidate planning, exchange limits, sorting, and map/mask-safe replacement.
test-moe-hot-cache-runtime-applyManual runtime apply semantics, full-rate exchange limits, and parser/updater compatibility.
test-moe-hot-cacheAggregate CMake and CTest target for the complete hot-cache test group.

Runtime Flow Per Layer

This flow is why adapters and graph kinds matter: a model may only use the graph path it explicitly registered for.

Model hook Parallel compute After request Layer hook small core touch Adapter Gate arch + graph_kind Router logits top-k experts Worklist capacity = tokens * used Hot lane VRAM cache, CUDA0, cached experts Cold lane original experts, usually CPU/RAM Merge add Perf counters hits, slots, lane time Update optional slot exchange
Prompt processing can send many tokens through the same hot/cold path. Decode is usually one token, while Qwen3Next has shown small batches of up to four tokens.

Senior Engineering Notes

These are the invariants and risk points that matter when reviewing or extending the hot-cache path.

  1. The adapter is the compatibility boundary. Core hooks should only ask whether arch + graph_kind is supported. Model-specific graph decisions belong in llama-moe-hot-cache-adapter.cpp, not in scattered model code.
  2. The scheduler relies on graph shape. The split must remain hot lane, cold lane, then join. If node order changes, the parallel region should fallback instead of silently producing a wrong graph.
  3. The cache is additive memory. The original CPU/RAM tensors are still required. The VRAM cache holds selected expert slices plus maps and masks; dynamic update mutates contents and maps, not tensor shapes.
  4. Prompt processing and decode have different economics. PP amortizes routing and merge work over many tokens. Decode is overhead-sensitive, so tiny shortcuts such as CPU routing and direct merge can matter more than raw matmul speed.
  5. Perf collection changes runtime cost. Full counters are best for diagnosis. Update mode keeps only what dynamic replacement needs. Off mode should remove measurable counter overhead from the hot path.
  6. Dynamic update must preserve capacity. Updates should exchange experts inside the existing budget. Reallocation during request processing would make latency and OOM behavior much harder to reason about.
Hard graph contractHot branch, cold branch, join. Any mismatch increments fallback counters and should be visible in /moe-layer-perf.
Model isolationQwen, Gemma, and Qwen3Next use adapter profiles so one model's shortcut does not leak into another model.
Memory safetyAuto sizing must leave enough reserve for KV, compute buffers, warmup, and optional draft/MTP contexts.
Correctness firstGibberish output is usually a graph-shape, expert-order, weight, or shared-expert integration bug, not a cache hit-rate issue.
Review targetsCheck adapter registration, layer-active guard, worklist construction, merge shape, fallback counters, and unit tests.
Recommended validationRun a learn pass, a hot-cache pass with perf full, a pass with perf off, and compare output quality plus TG/PP rates.

Add a New MoE Model

The goal is one new adapter and a very small model hook, with no scattered logic in the llama core.

  1. Confirm the MoE shape. In the model code, locate router logits, n_expert, n_expert_used, and the gate, up, and down expert tensors.
  2. Choose the graph kind. Use logits when the model already has router logits before the MoE FFN. Use a custom graph kind only when the hot graph must build the router logits itself.
  3. Register the adapter. Add the model to ADAPTERS and provide arch, name, graph_kind, and ffn_op. { LLM_ARCH_NEWMODEL, "newmodel", llama_moe_hot_cache_graph_kind::logits, LLM_FFN_SILU }
  4. Encapsulate the profile. Put CPU routing limits, direct merge, branch reduce, and PP reduce decisions into the adapter profile.
  5. Keep the model hook small. Model code should only check whether the layer is active for this graph kind. The actual hot/cold logic stays in the hot-cache folder. if (llama_moe_hot_cache_layer_active_for_graph(model, il, llama_moe_hot_cache_graph_kind::logits)) { cur_moe = build_layer_moe_hot(cur_moe, logits, il); }
  6. Extend tests. At minimum, tests/test-moe-hot-cache-adapter.cpp must verify the new adapter. Add specialized tests if JSON parsing, weighting, or planning behaves differently.
  7. Validate with a learn run. First write an expert list without the hot cache, then start with the hot cache and inspect /moe-layer-perf for hits, fallbacks, and lane timings.
Source of Truthllama-moe-hot-cache-adapter.cpp decides which model may use which hot graph.
No Side EffectsA Qwen-specific graph kind must not automatically apply to Gemma or a new model.
Fallback Is NormalIf no adapter matches or the layer is inactive, the existing model path continues.
PP Differs From DecodePrompt processing can have large batches. Decode usually has tiny batches and is more overhead-sensitive.
Minimum Testctest --test-dir build -R '^test-moe-hot-cache$' --output-on-failure
Buildcmake --build build -j8 --target llama-server

LLM Agent Runbook: Add a Model

This section is written for coding agents. Follow it in order and keep the patch small, reviewable, and isolated.

  1. Identify the MoE implementation before editing. Search the model file for build_moe_ffn, ffn_gate_inp, n_expert, n_expert_used, and expert tensors. Do not assume Qwen, Gemma, and the new model share tensor layout.
  2. Choose the smallest supported graph kind. Prefer llama_moe_hot_cache_graph_kind::logits when router logits are already available. Create a new graph kind only if the model cannot call the generic logits builder safely.
  3. Add exactly one adapter entry. Edit src/moe-hot-cache/llama-moe-hot-cache-adapter.cpp. Add the LLM_ARCH_*, adapter name, graph kind, and LLM_FFN_* operation. Put model-specific shortcut choices in the profile function.
  4. Add one tiny model hook. In the model file, guard the hot path with llama_moe_hot_cache_layer_active_for_graph(model, il, graph_kind). The hook should route to a hot-cache builder and otherwise leave the existing path unchanged.
  5. Preserve non-MoE branches. Shared experts, dense FFN branches, residual adds, norms, and multimodal side paths must remain exactly where they were unless the model specifically requires otherwise.
  6. Add focused tests. Update tests/test-moe-hot-cache-adapter.cpp so unsupported graph kinds are rejected and the new adapter profile has expected defaults. Add a specialized test only if parser, planner, weighting, or worklist behavior changes.
  7. Validate in three stages. First build and run unit tests. Then run a learn pass to create JSON. Then run hot-cache inference and inspect /moe-layer-perf for hit rate, fallbacks, lane timings, and output quality.
  8. Stop when correctness is unclear. If output becomes gibberish, fallbacks climb, graph shape changes unexpectedly, or shared-expert behavior is uncertain, stop and document the failure instead of adding more shortcuts.
Preferred edit locationssrc/moe-hot-cache/*, one model hook file, and targeted tests/test-moe-hot-cache-*.cpp.
Avoid broad core editsDo not modify scheduler, GGML CUDA kernels, or unrelated model code unless the requested model cannot work without it.
Required guardUse llama_moe_hot_cache_layer_active_for_graph, not a loose architecture check.
Build commandcmake --build build -j8 --target test-moe-hot-cache llama-server
Test commandctest --test-dir build -R '^test-moe-hot-cache$' --output-on-failure
Runtime checksWatch parallel_fallbacks, hot_slots_total, cold_slots_total, branch timings, and whether generated text remains sane.
Patch review questionCould this change alter Qwen, Gemma, or Qwen3Next when the new model is not loaded? If yes, isolate it further.

Files for Orientation

File Responsibility Plain-English Explanation
src/moe-hot-cache/llama-moe-hot-cache.cpp Lifecycle orchestration and public hot-cache entry points Coordinates parser, weighting, budget, planner, builder, updater, and layer-active checks.
src/moe-hot-cache/llama-moe-hot-cache-parser.cpp Parser class for perf JSON and slot totals Turns JSON into typed observations used by cache construction and updates.
src/moe-hot-cache/llama-moe-hot-cache-adapter.cpp Model adapter, graph kind, profile, and tweak defaults This is the central allow-list: only registered models may use the hot graph.
src/moe-hot-cache/llama-moe-hot-cache.h Data structures for cache, layers, worklists, and update stats This is the shared vocabulary of the feature.
src/moe-hot-cache/llama-moe-hot-cache-graph.cpp Hot/cold graph, worklist nodes, and merge paths Turns the feature design into a GGML graph that can compute.
src/moe-hot-cache/llama-moe-hot-cache-builder.cpp Create cache tensors and copy expert slices Translates the plan into real VRAM buffers.
src/moe-hot-cache/llama-moe-hot-cache-worklist.cpp Build hot/cold slot lists from router results Creates the per-layer list consumed by the hot and cold lanes.
src/moe-hot-cache/llama-moe-hot-cache-perf.cpp Perf mode, graph callbacks, and scheduler metric collection Coordinates when counters are active and connects GGML execution back to perf state.
src/moe-hot-cache/llama-moe-hot-cache-perf-state.cpp Thread-safe state for performance counters Collects hits, hot/cold slots, timings, and fallback counters.
src/moe-hot-cache/llama-moe-hot-cache-perf-json.cpp JSON serialization for /moe-layer-perf Turns the current perf snapshot into the JSON used by the UI, learn files, manual applies, and automatic updates.
src/moe-hot-cache/llama-moe-hot-cache-perf-nodes.cpp GGML node classification Maps tensor names to routing, worklist, branch, matmul, gather/scatter, merge, and update timing buckets.
src/moe-hot-cache/llama-moe-hot-cache-perf-reader.cpp Backend tensor readback for perf counters Reads Top-K and worklist tensors after execution so experts and hot/cold slots can be counted.
src/moe-hot-cache/llama-moe-hot-cache-weighting.cpp Generic expert weighting and layer curves Turns hits and layer pressure into a cache ranking.
src/moe-hot-cache/llama-moe-hot-cache-planner.cpp Expert sizes and budget selection Decides which experts fit into the given MiB budget.
src/moe-hot-cache/llama-moe-hot-cache-budget.cpp GPU selection and automatic VRAM budget When -1 is used, this computes how much VRAM remains after reserves.
src/moe-hot-cache/llama-moe-hot-cache-updater.cpp Automatic and manual cache replacement Decides which hot experts are evicted and which experts move in without changing cache tensor shapes.
src/models/qwen35moe-hot-cache.cpp Qwen compatibility wrapper for weighting Old Qwen API names remain stable while the logic lives in the hot-cache folder.
src/models/gemma4-hot-cache.cpp Gemma4 weighting wrapper Uses generic weighting while allowing the Gemma-specific layer-curve environment override.
src/models/qwen35moe.cpp, src/models/gemma4.cpp, src/models/qwen3next.cpp Small model hooks These files should only check whether the adapter allows the matching graph kind.
ggml/src/ggml-backend-moe-hot-cache.inc Scheduler extension for parallel hot/cold regions Connects the parallel execution path to GGML.
ggml/include/ggml-backend-moe-hot-cache.h Public scheduler metric structures Defines the parallel-region perf data read by the llama-side perf collector.
tools/server/server-context.cpp Perf file output and runtime cache updates Marks automatic updates pending, logs hit rate, writes perf output, queues manual applies until slots are idle, and runs slot replacement.
tools/server/server.cpp, tools/server/server-models.cpp /moe-layer-perf and /moe-hot-cache routing Expose the live perf JSON, runtime perf-mode switch, and manual hot-cache apply endpoint through HTTP.
common/common.h, include/llama.h CLI and public parameter storage Carry hot-cache path, max MiB, auto reserve, weighting, layer curve, and perf output settings into model initialization.
tests/test-moe-hot-cache-*.cpp Specialized unit tests per component New models should at least get adapter tests, plus targeted tests for special logic. The aggregate target is test-moe-hot-cache.