From slow CPU-bound MoE decode to a gated hot/cold graph

The work started with one practical problem: large MoE models such as Qwen3.5/Qwen3.6 could run on a small GPU only when most experts stayed in RAM. The GPU was not the only bottleneck. Decode waited on CPU expert work, routing overhead, merge overhead, and scheduler synchronization. The final direction was therefore not "put everything on the GPU", but "copy the right experts into VRAM, keep the original experts on CPU, split the work, and make the split cheap enough to win".

Performance Arc

These are development observations, not a clean benchmark suite. They show the shape of the progress.

Early non-parallel reference 19.7 t/s

Simple prompt before the hot/cold parallel path became useful.

First working but poorly matched hot list 13.96 t/s

The graph worked, but the selected experts were wrong for the workload.

After routing, scheduler, merge, and cold-path work 28.09 t/s

The Qwen3.6 path became faster than the original router setup.

Qwen3.6 standard llama router baseline 22.2 t/s

Router-mode reference for the Snake coding prompt at 100k context.

Break-even hit rate 45.89%

The hot cache needed roughly this real hit rate to beat the baseline.

Observed practical hit-rate ceiling ~70%

That made overhead reduction as important as selecting more hot experts.

Timeline

The path was iterative: first correctness, then visibility, then targeted overhead removal.

1. Baseline

CPU/RAM experts made decode CPU-bound

The target hardware could not keep all useful MoE expert tensors in VRAM. The useful split was GPU for hot expert copies and CPU for the unavoidable cold misses.

2. Hot cache

Selected expert slices were copied into VRAM

The original expert tensors stayed in the model. The hot cache became an additional VRAM copy, not a replacement. This kept cold fallback and dynamic updates possible.

3. Worklist

Top-k expert slots were split into hot and cold work

The graph needed a compact worklist with IDs, source slots, token IDs, weights, and counts. Without this split, the scheduler could not run the two lanes independently.

4. Scheduler

Hot branch, cold branch, join became a strict region

The scheduler only parallelizes the graph when node order and backend placement match the expected hot/cold/join shape. Bad regions fallback instead of silently producing wrong output.

5. Measurements

Perf JSON changed from visualization to optimization data

Useful tuning required hit rates, branch timings, overlap, join wait, fallback reasons, and expert lists. Full counters became optional because they also cost performance.

6. Defaults

Flat weighting and conservative auto-sizing became safer defaults

Even layer coverage proved more stable than over-favoring a few layers. Auto sizing needed a reserve to avoid CUDA OOM during warmup and transient compute buffers.

7. Separation

The feature was refactored into a hot-cache package

Parser, weighting, planner, budget, builder, worklist, adapter, perf, and updater moved behind focused components to reduce rebase conflicts and model side effects.

Architecture Lessons

The stable design rules that survived the experiments.

Memory model The cache is additive

Hot experts are extra VRAM copies. RAM use remains because cold experts and dynamic replacement still need the original tensors.

Placement --cpu-moe is part of the model

The intended graph expects cold experts on the CPU/RAM path. Random offload layouts do not provide the same controlled hot/cold split.

Routing Decode overhead is not small

For one-token decode, routing, worklist creation, views, and merge work can dominate enough that CPU decode routing and direct merge shortcuts matter.

Hit rate A 100% hit rate is not realistic

The useful target was not perfect cache coverage. With a practical ceiling near 70%, the cold lane had to become cheaper.

Prompt processing PP is a different workload

Prompt processing can push many tokens through the path at once. PP reduce-merge helps by reducing branch outputs before the final merge.

Model support Adapters prevent accidental side effects

Each supported model registers its graph kind and profile. Qwen, Gemma, and Qwen3Next can tune behavior independently.

Perf JSON expert usage timings Planner weighting VRAM budget Worklist hot slots cold slots Hot lane GPU cache Cold lane CPU experts Join merge

Dead Ends And Why They Mattered

  • Unfavorable hot lists: the first successful graph could be slower than baseline when the cache reflected the wrong workload.
  • Full layer overrides: putting whole weak layers on the GPU consumed too much VRAM and still left hot/cold overhead unless a dedicated bypass existed.
  • MTP on 12 GB VRAM: high draft acceptance did not compensate for the extra MTP context and compute memory. One observed MTP run reached about 25.33 t/s, below strong non-MTP hot-cache runs.
  • Random fallback for missing MTP layer data: removed because it made behavior non-deterministic and hid bad profiling data.
  • Quadro M1200 warm lane: a second slower GPU added synchronization and transfer pressure back to CUDA0, so CPU cold lane remained better on this hardware.

What Stayed

  • Gated activation: without --moe-hot-cache and a non-zero budget, normal llama.cpp behavior remains the default.
  • Adapter allow-list: each model must explicitly opt in to a graph kind and profile.
  • Flat weighting default: spread the budget across layers first, then optimize within each layer.
  • Runtime perf modes: full diagnostics when tuning, update-only counters for adaptive cache changes, off for raw throughput tests.
  • Manual runtime apply: /moe-hot-cache can apply a new expert JSON while the server is running, waits for idle slots, and changes only cache deltas.
  • Auto-size reserve: keep enough VRAM for KV, warmup, CUDA compute buffers, and transient allocations.

MTP Lessons

The MTP experiment was useful, but it did not become the recommended path for the local 12 GB setup.

Observed result High acceptance was not enough

One run reached about 25.33 t/s with 94.1% draft acceptance, while non-MTP hot-cache runs were faster on the same class of workload.

Memory pressure MTP creates another context

draft-mtp creates a draft context against the target model after the target context exists. A failed run needed another 800 MiB CUDA compute buffer and hit OOM.

Layer behavior The MTP block is a real extra layer

For Qwen3.6-35B-A3B the MTP layer was layer 40: forty normal transformer layers plus one NextN/MTP layer.

Selection rule Do not randomly fill missing layers

Random fallback for missing MTP perf data was removed because it hid bad profiles and made benchmark runs non-reproducible.

Question Answer Learned From The Experiment
Why did MTP fail without the hot cache? The coarse tensor override placed several full MoE layers plus the full MTP layer on CUDA0. Normal inference could fit, but the later MTP context and graph reserve needed additional memory and failed.
Why was the MTP layer initially cold? The startup perf JSON had no layer-40 data. A fully inactive hot-cache layer has no hot slots, so the dynamic update path cannot replace entries inside it later.
What would a deterministic MTP retry need? An explicit MTP priority ratio, detected from real NextN/MTP layers, not a broad random fallback. With MTP active, a conservative reserve around 1600 MiB is safer than the normal 1024 MiB.
Why was the local MTP code not kept as the default? It increased VRAM pressure, reduced room for the hot cache, and did not beat the best non-MTP hot-cache throughput on the RTX 2060 setup.
What is the rebase lesson? Do not put broad MTP workarounds in speculative or model core code. If MTP hot-cache is revisited, keep the model hook narrow and place selection, budget, and graph behavior inside the hot-cache package.

Second GPU Warm-Lane Lessons

The Quadro M1200 experiment tested whether a slow secondary GPU could replace part of the CPU cold lane.

The tested data flow was: hot experts on CUDA0, warm experts on CUDA1, cold experts on CPU, and final join/merge back on CUDA0. The warm cache used the same /moe-layer-perf data as the hot cache. No separate expert list was useful.
Run Slot Split Timing Interpretation
Broad warm lane 84.73% hot, 7.11% warm, 8.16% cold 1444.87 us total MoE/call, 557.67 us join wait CUDA1 was slower than CPU cold on average once transfer, bridge, and synchronization were included.
Timing-gated warm lane 83.70% hot, 2.61% warm, 13.69% cold 1192.73 us total MoE/call, 472.26 us join wait Removing the worst warm layers helped, but the remaining warm-enabled layers became slower than CPU cold after the distribution changed.
Second-stage no-warm result 83.49% hot, 0.00% warm, 16.51% cold 1085.95 us total MoE/call, 423.56 us join wait The best tested state for Gemma4 on this hardware pair was CUDA0 hot cache plus CPU cold branch, with CUDA1 warm disabled.
Main caveat A second GPU is not a free cold lane

The final merge happened on CUDA0. CUDA1 work had to cross a bridge path and could delay the same join that the CPU cold lane already used.

Decision rule Validate twice

Enable warm only for layers where warm is clearly faster than cold, then rerun and re-evaluate. If warm loses after redistribution, disable it.

Practical result Keep the warm lane out by default

For RTX 2060 plus Quadro M1200, the sync and transfer cost outweighed the saved CPU expert work.

Key Decisions

These decisions are the practical design contract for future work.

Correctness boundary Fallback beats silent graph drift

If the scheduler cannot prove hot, cold, and join order, it falls back and reports a reason.

Update boundary Replacement keeps tensor shapes stable

Dynamic update exchanges expert contents and maps inside existing slots. It does not reallocate the cache mid-run.

Rebase boundary Core hooks stay small

Most logic belongs in src/moe-hot-cache; model files should contain narrow guarded hooks.

If We Had To Rebuild It

The shortest implementation plan that preserves the useful lessons.

Step Build This Why
1 Perf JSON with expert hits, branch slots, timings, and fallback reasons. You cannot optimize a static expert cache without workload-specific data.
2 Parser, weighting, planner, budget, and builder as isolated components. These pieces are testable without running a model and reduce upstream conflicts.
3 Worklist tensor with compact hot and cold prefixes. The graph and scheduler need explicit work boundaries and counts.
4 Adapter profiles per model. Qwen, Gemma, and Qwen3Next have different routing, activation, merge, and tiny-batch behavior.
5 Hot/cold graph with strict scheduler validation. The speedup comes from overlap, but correctness depends on graph shape.
6 Dynamic update after requests, bounded by update rate. It adapts to modest workload drift without changing cache capacity or restarting.
7 Separate PP optimization from decode optimization. Large prompt batches and one-token decode have different bottlenecks.

Migrated Notes

The old Markdown notes were folded into the HTML documentation so this folder has one canonical documentation format.

Former Note Where The Information Lives Now
Parallelization history This Journey page covers the performance arc, implementation timeline, hot/cold worklist, scheduler lessons, bottlenecks, and next levers.
Developer guide The Architecture Explainer covers data models, class ownership, file map, runtime flow, model adapters, and the LLM-agent runbook. The Usage Guide covers commands, arguments, weighting modes, and environment switches.
MTP learnings The MTP Lessons section above preserves the memory, layer-selection, deterministic-priority, performance, and rebase-isolation conclusions.
Runtime switches The Usage Guide's argument and advanced environment tables are the canonical switch reference.
Warm-lane analysis The Second GPU Warm-Lane Lessons section above preserves the setup, timings, interpretation, and practical decision rule.
The only Markdown note intentionally kept in this directory is the dense-model TG optimization note, because it is a separate future-work topic and not part of the current MoE hot-cache user/developer docs.