Confirm the Architecture
Qwen3-Coder-Next is a Qwen3Next MoE model with 80B total parameters and 3B active parameters.
src/models/qwen3next.cpp LLM_ARCH_QWEN3NEXT n_layer = 48 n_expert = 512 n_expert_used = 10
Qwen3-Coder-Next uses the Qwen3Next path. Each layer has an MoE block and an additional shared-expert block. The hot cache may only split the normal top-k MoE part. The shared expert stays in the normal model graph and is added afterward.
Qwen3-Coder-Next is a Qwen3Next MoE model with 80B total parameters and 3B active parameters.
src/models/qwen3next.cpp LLM_ARCH_QWEN3NEXT n_layer = 48 n_expert = 512 n_expert_used = 10
In the Qwen3Next FFN, the normal MoE output is built first. The shared expert is applied afterward. The hot cache replaces only the MoE output.
if (llama_moe_hot_cache_layer_active_for_graph(model, il, llama_moe_hot_cache_graph_kind::logits)) {
logits = build_lora_mm(ffn_gate_inp, cur);
moe_out = build_layer_moe_hot(cur, logits, il);
} else {
moe_out = build_moe_ffn(...);
}
Qwen3Next uses the same hot/cold infrastructure as Gemma4: router logits in, hot/cold MoE output out.
llama_model_qwen3next::graph::build_layer_moe_hot(...)
-> llama_moe_hot_cache_build_moe_hot_from_logits(..., LLM_FFN_SILU)
The model explicitly opts into hot-cache shortcuts. Other MoE models remain unchanged.
{ LLM_ARCH_QWEN3NEXT, "qwen3next", llama_moe_hot_cache_graph_kind::logits, LLM_FFN_SILU }
The first run collects expert hits. Without this JSON, the hot cache does not know which experts to preload.
./build/bin/llama-server \ --cpu-moe \ --moe-layer-perf-out qwen3-coder-next-experts.json \ <normal model args>
Start with a conservative reserve and performance counters enabled, then inspect /moe-layer-perf for fallbacks and hit rate.
./build/bin/llama-server \ --cpu-moe \ --moe-hot-cache qwen3-coder-next-experts.json \ --moe-hot-cache-max-mib -1 \ --moe-hot-cache-auto-reserve-mib 1024 \ --moe-hot-cache-update-rate 0.10 \ <normal model args>
parallel_fallbacks remains 0 or stops increasing.--no-perf.