Token Video

A token moves through the router, hot lane, cold lane, merge, and output.

Hot Expert Cold Expert Just used
Router / Gate
T

The gate selects the top-k experts for the current token.

Worklist
Hot 0 Cold 0

Selected expert slots become compact hot or cold jobs.

CUDA0 Hot Lane from cache

Hot-cache hits run directly from VRAM on the GPU.

CPU Cold Lane original experts

Non-cached experts run through the normal CPU/RAM path.

Merge

Both partial outputs are added into the normal MoE layer result.

Output

After the merge, the next layer receives a single token representation again.

T

Layer Heatmap

Each square represents an expert group. Red is cached, blue remains cold, and yellow was active in the previous token.

Live Hit Rate

The line shows the simulated hot-cache hit rate. Below the break-even point, parallelization overhead can consume the gain.

Baseline 22.2 t/s
Hot Cache Simulation 26.1 t/s
Break-even 45.9%
Dynamic Update 10%