Token Video
A token moves through the router, hot lane, cold lane, merge, and output.
Hot Expert
Cold Expert
Just used
Router / Gate
T
The gate selects the top-k experts for the current token.
Worklist
Hot 0
Cold 0
Selected expert slots become compact hot or cold jobs.
CUDA0 Hot Lane from cache
Hot-cache hits run directly from VRAM on the GPU.
CPU Cold Lane original experts
Non-cached experts run through the normal CPU/RAM path.
Merge
Both partial outputs are added into the normal MoE layer result.
Output
After the merge, the next layer receives a single token representation again.
T
Layer Heatmap
Each square represents an expert group. Red is cached, blue remains cold, and yellow was active in the previous token.
Live Hit Rate
The line shows the simulated hot-cache hit rate. Below the break-even point, parallelization overhead can consume the gain.
Baseline
22.2 t/s
Hot Cache Simulation
26.1 t/s
Break-even
45.9%
Dynamic Update
10%