From slow CPU-bound MoE decode to a gated hot/cold graph
The work started with one practical problem: large MoE models such as Qwen3.5/Qwen3.6 could run on a small GPU only when most experts stayed in RAM. The GPU was not the only bottleneck. Decode waited on CPU expert work, routing overhead, merge overhead, and scheduler synchronization. The final direction was therefore not "put everything on the GPU", but "copy the right experts into VRAM, keep the original experts on CPU, split the work, and make the split cheap enough to win".
Performance Arc
These are development observations, not a clean benchmark suite. They show the shape of the progress.
Simple prompt before the hot/cold parallel path became useful.
The graph worked, but the selected experts were wrong for the workload.
The Qwen3.6 path became faster than the original router setup.
Router-mode reference for the Snake coding prompt at 100k context.
The hot cache needed roughly this real hit rate to beat the baseline.
That made overhead reduction as important as selecting more hot experts.
Timeline
The path was iterative: first correctness, then visibility, then targeted overhead removal.
CPU/RAM experts made decode CPU-bound
The target hardware could not keep all useful MoE expert tensors in VRAM. The useful split was GPU for hot expert copies and CPU for the unavoidable cold misses.
Selected expert slices were copied into VRAM
The original expert tensors stayed in the model. The hot cache became an additional VRAM copy, not a replacement. This kept cold fallback and dynamic updates possible.
Top-k expert slots were split into hot and cold work
The graph needed a compact worklist with IDs, source slots, token IDs, weights, and counts. Without this split, the scheduler could not run the two lanes independently.
Hot branch, cold branch, join became a strict region
The scheduler only parallelizes the graph when node order and backend placement match the expected hot/cold/join shape. Bad regions fallback instead of silently producing wrong output.
Perf JSON changed from visualization to optimization data
Useful tuning required hit rates, branch timings, overlap, join wait, fallback reasons, and expert lists. Full counters became optional because they also cost performance.
Flat weighting and conservative auto-sizing became safer defaults
Even layer coverage proved more stable than over-favoring a few layers. Auto sizing needed a reserve to avoid CUDA OOM during warmup and transient compute buffers.
The feature was refactored into a hot-cache package
Parser, weighting, planner, budget, builder, worklist, adapter, perf, and updater moved behind focused components to reduce rebase conflicts and model side effects.
Architecture Lessons
The stable design rules that survived the experiments.
Hot experts are extra VRAM copies. RAM use remains because cold experts and dynamic replacement still need the original tensors.
--cpu-moe is part of the model
The intended graph expects cold experts on the CPU/RAM path. Random offload layouts do not provide the same controlled hot/cold split.
For one-token decode, routing, worklist creation, views, and merge work can dominate enough that CPU decode routing and direct merge shortcuts matter.
The useful target was not perfect cache coverage. With a practical ceiling near 70%, the cold lane had to become cheaper.
Prompt processing can push many tokens through the path at once. PP reduce-merge helps by reducing branch outputs before the final merge.
Each supported model registers its graph kind and profile. Qwen, Gemma, and Qwen3Next can tune behavior independently.
Dead Ends And Why They Mattered
- Unfavorable hot lists: the first successful graph could be slower than baseline when the cache reflected the wrong workload.
- Full layer overrides: putting whole weak layers on the GPU consumed too much VRAM and still left hot/cold overhead unless a dedicated bypass existed.
- MTP on 12 GB VRAM: high draft acceptance did not compensate for the extra MTP context and compute memory. One observed MTP run reached about
25.33 t/s, below strong non-MTP hot-cache runs. - Random fallback for missing MTP layer data: removed because it made behavior non-deterministic and hid bad profiling data.
- Quadro M1200 warm lane: a second slower GPU added synchronization and transfer pressure back to CUDA0, so CPU cold lane remained better on this hardware.
What Stayed
- Gated activation: without
--moe-hot-cacheand a non-zero budget, normal llama.cpp behavior remains the default. - Adapter allow-list: each model must explicitly opt in to a graph kind and profile.
- Flat weighting default: spread the budget across layers first, then optimize within each layer.
- Runtime perf modes: full diagnostics when tuning, update-only counters for adaptive cache changes, off for raw throughput tests.
- Manual runtime apply:
/moe-hot-cachecan apply a new expert JSON while the server is running, waits for idle slots, and changes only cache deltas. - Auto-size reserve: keep enough VRAM for KV, warmup, CUDA compute buffers, and transient allocations.
MTP Lessons
The MTP experiment was useful, but it did not become the recommended path for the local 12 GB setup.
One run reached about 25.33 t/s with 94.1% draft acceptance, while non-MTP hot-cache runs were faster on the same class of workload.
draft-mtp creates a draft context against the target model after the target context exists. A failed run needed another 800 MiB CUDA compute buffer and hit OOM.
For Qwen3.6-35B-A3B the MTP layer was layer 40: forty normal transformer layers plus one NextN/MTP layer.
Random fallback for missing MTP perf data was removed because it hid bad profiles and made benchmark runs non-reproducible.
| Question | Answer Learned From The Experiment |
|---|---|
| Why did MTP fail without the hot cache? | The coarse tensor override placed several full MoE layers plus the full MTP layer on CUDA0. Normal inference could fit, but the later MTP context and graph reserve needed additional memory and failed. |
| Why was the MTP layer initially cold? | The startup perf JSON had no layer-40 data. A fully inactive hot-cache layer has no hot slots, so the dynamic update path cannot replace entries inside it later. |
| What would a deterministic MTP retry need? | An explicit MTP priority ratio, detected from real NextN/MTP layers, not a broad random fallback. With MTP active, a conservative reserve around 1600 MiB is safer than the normal 1024 MiB. |
| Why was the local MTP code not kept as the default? | It increased VRAM pressure, reduced room for the hot cache, and did not beat the best non-MTP hot-cache throughput on the RTX 2060 setup. |
| What is the rebase lesson? | Do not put broad MTP workarounds in speculative or model core code. If MTP hot-cache is revisited, keep the model hook narrow and place selection, budget, and graph behavior inside the hot-cache package. |
Second GPU Warm-Lane Lessons
The Quadro M1200 experiment tested whether a slow secondary GPU could replace part of the CPU cold lane.
CUDA0, warm experts on CUDA1,
cold experts on CPU, and final join/merge back on CUDA0. The warm cache used the same
/moe-layer-perf data as the hot cache. No separate expert list was useful.
| Run | Slot Split | Timing | Interpretation |
|---|---|---|---|
| Broad warm lane | 84.73% hot, 7.11% warm, 8.16% cold |
1444.87 us total MoE/call, 557.67 us join wait |
CUDA1 was slower than CPU cold on average once transfer, bridge, and synchronization were included. |
| Timing-gated warm lane | 83.70% hot, 2.61% warm, 13.69% cold |
1192.73 us total MoE/call, 472.26 us join wait |
Removing the worst warm layers helped, but the remaining warm-enabled layers became slower than CPU cold after the distribution changed. |
| Second-stage no-warm result | 83.49% hot, 0.00% warm, 16.51% cold |
1085.95 us total MoE/call, 423.56 us join wait |
The best tested state for Gemma4 on this hardware pair was CUDA0 hot cache plus CPU cold branch, with CUDA1 warm disabled. |
The final merge happened on CUDA0. CUDA1 work had to cross a bridge path and could delay the same join that the CPU cold lane already used.
Enable warm only for layers where warm is clearly faster than cold, then rerun and re-evaluate. If warm loses after redistribution, disable it.
For RTX 2060 plus Quadro M1200, the sync and transfer cost outweighed the saved CPU expert work.
Key Decisions
These decisions are the practical design contract for future work.
If the scheduler cannot prove hot, cold, and join order, it falls back and reports a reason.
Dynamic update exchanges expert contents and maps inside existing slots. It does not reallocate the cache mid-run.
Most logic belongs in src/moe-hot-cache; model files should contain narrow guarded hooks.
If We Had To Rebuild It
The shortest implementation plan that preserves the useful lessons.
| Step | Build This | Why |
|---|---|---|
| 1 | Perf JSON with expert hits, branch slots, timings, and fallback reasons. | You cannot optimize a static expert cache without workload-specific data. |
| 2 | Parser, weighting, planner, budget, and builder as isolated components. | These pieces are testable without running a model and reduce upstream conflicts. |
| 3 | Worklist tensor with compact hot and cold prefixes. | The graph and scheduler need explicit work boundaries and counts. |
| 4 | Adapter profiles per model. | Qwen, Gemma, and Qwen3Next have different routing, activation, merge, and tiny-batch behavior. |
| 5 | Hot/cold graph with strict scheduler validation. | The speedup comes from overlap, but correctness depends on graph shape. |
| 6 | Dynamic update after requests, bounded by update rate. | It adapts to modest workload drift without changing cache capacity or restarting. |
| 7 | Separate PP optimization from decode optimization. | Large prompt batches and one-token decode have different bottlenecks. |
Migrated Notes
The old Markdown notes were folded into the HTML documentation so this folder has one canonical documentation format.
| Former Note | Where The Information Lives Now |
|---|---|
| Parallelization history | This Journey page covers the performance arc, implementation timeline, hot/cold worklist, scheduler lessons, bottlenecks, and next levers. |
| Developer guide | The Architecture Explainer covers data models, class ownership, file map, runtime flow, model adapters, and the LLM-agent runbook. The Usage Guide covers commands, arguments, weighting modes, and environment switches. |
| MTP learnings | The MTP Lessons section above preserves the memory, layer-selection, deterministic-priority, performance, and rebase-isolation conclusions. |
| Runtime switches | The Usage Guide's argument and advanced environment tables are the canonical switch reference. |
| Warm-lane analysis | The Second GPU Warm-Lane Lessons section above preserves the setup, timings, interpretation, and practical decision rule. |