Qwen3-Coder-Next Hot-Cache Implementation Guide

ArchitectureQwen3Next

Parameters80B / 3B

Layer48

Experts512 x 10

Why the Hook Differs From Qwen35MoE

Qwen3-Coder-Next uses the Qwen3Next path. Each layer has an MoE block and an additional shared-expert block. The hot cache may only split the normal top-k MoE part. The shared expert stays in the normal model graph and is added afterward.

Confirm the Architecture

Qwen3-Coder-Next is a Qwen3Next MoE model with 80B total parameters and 3B active parameters.

src/models/qwen3next.cpp
LLM_ARCH_QWEN3NEXT
n_layer = 48
n_expert = 512
n_expert_used = 10

Split Only the Top-k MoE Part

In the Qwen3Next FFN, the normal MoE output is built first. The shared expert is applied afterward. The hot cache replaces only the MoE output.

if (llama_moe_hot_cache_layer_active_for_graph(model, il, llama_moe_hot_cache_graph_kind::logits)) {
    logits = build_lora_mm(ffn_gate_inp, cur);
    moe_out = build_layer_moe_hot(cur, logits, il);
} else {
    moe_out = build_moe_ffn(...);
}

Reuse the Generic Hot Graph

Qwen3Next uses the same hot/cold infrastructure as Gemma4: router logits in, hot/cold MoE output out.

llama_model_qwen3next::graph::build_layer_moe_hot(...)
    -> llama_moe_hot_cache_build_moe_hot_from_logits(..., LLM_FFN_SILU)

Register a Dedicated Graph Profile

The model explicitly opts into hot-cache shortcuts. Other MoE models remain unchanged. Qwen3Next can allow tiny multi-token decode routing through its profile, while prompt-processing decisions still come from the explicit graph phase.

{ LLM_ARCH_QWEN3NEXT, "qwen3next", llama_moe_hot_cache_graph_kind::logits, LLM_FFN_SILU }

Create a Learn Run

The first run collects expert hits. Without this JSON, the hot cache does not know which experts to preload.

./build/bin/llama-server \
  --cpu-moe \
  --moe-layer-perf-out qwen3-coder-next-experts.json \
  <normal model args>

Test Hot-Cache Startup

Start with a conservative reserve and performance counters enabled, then inspect /moe-layer-perf for fallbacks and hit rate.

./build/bin/llama-server \
  --cpu-moe \
  --moe-hot-cache qwen3-coder-next-experts.json \
  --moe-hot-cache-max-mib -1 \
  --moe-hot-cache-auto-reserve-mib 1024 \
  --moe-hot-cache-update-rate 0.10 \
  <normal model args>

Validation

No crash during warmup.
parallel_fallbacks remains 0 or stops increasing.
Output remains semantically plausible, especially because of the shared expert.
Tiny multi-token decode batches are still reported as decode behavior, not PP behavior.
Compare hit rate and TG with and without --no-perf.
If fallbacks appear, disable profile shortcuts one by one.

The first stable run matters more than maximum speed. Fine-tuning merge and cold-prefix paths is useful only after split order and output are stable.