MoE Experts First Visual Explainer

Token Video

A token moves through the router, hot lane, cold lane, merge, and output.

Hot Expert Cold Expert Just used

Router / Gate

The gate selects the top-k experts for the current token.

Worklist

Hot 0 Cold 0

Selected expert slots become compact hot or cold jobs.

CUDA0 Hot Lane from cache

Hot-cache hits run directly from VRAM on the GPU.

CPU Cold Lane original experts

Non-cached experts run through the normal CPU/RAM path.

Merge

Both partial outputs are added into the normal MoE layer result.

Output

After the merge, the next layer receives a single token representation again.

Each square represents an expert group. Red is cached, blue remains cold, and yellow was active in the previous token.

The line shows the simulated hot-cache hit rate. Below the break-even point, parallelization overhead can consume the gain.

Baseline 22.2 t/s

Hot Cache Simulation 26.1 t/s

Break-even 45.9%

Dynamic Update 10%