In this architecture, routing is algorithmic and static — it is computed during the C++ preprocessing phase rather than by a trainable gating network. This guarantees stable, predictable routing and eliminates the need for load-balancing losses during training.

To build a strong shared representation while allowing experts to specialise, training is split into three phases.

Phase 1 — Backbone Pre-Training

Goal: Build a strong shared representation before specialisation.

In this phase, only the 1×1 mixer, the global feature head, and a single shared expert body are trained. All positions in the dataset contribute equally, and the algorithmic routing labels are ignored.

This ensures that the shared layers learn a universal representation of chess positions. At the end of this phase, the weights of the single trained expert are copied to initialise all the individual experts for Phase 2.

Parameter	Typical value
Epochs	8-12
Learning rate	1e-3 (cosine decay)
Batch size	4096
Loss	WDL cross-entropy + eval MSE

Phase 2 — Expert Specialisation

Goal: Fine-tune each expert body independently on its assigned subset of positions.

The shared backbone (mixer + global features) is frozen. The static routing decisions computed during preprocessing determine which positions each expert sees during training. Each expert is fine-tuned independently:

Expert	Specialisation	Training subset
Expert 0	Tactical	Positions with checks, captures, threats, hanging pieces
Expert 1	Strategical	Stable middlegame positions with structural imbalances
Expert 2	Major-piece endgame	Rook endings, queen endings, rook+pawn endings
Expert 3	Minor-piece endgame	Bishop/knight endings, opposite-colour bishop endings

The ExpertChunkedSampler in dataset.py ensures each expert only sees positions matching its algorithmic routing label.

Parameter	Typical value
Epochs per expert	5-8
Learning rate	1e-4
Frozen layers	Mixer, global features

Phase 4 — Joint Fine-Tuning (Top-2)

Goal: Joint fine-tuning of the entire network for final convergence.

All layers — mixer and all experts — are unfrozen and trained jointly at a low learning rate. This allows the components to adapt to each other and resolve any inconsistencies introduced by the independent phase 2 training.

The algorithmic router continues to dictate the forward pass routing, but the entire network's weights are jointly optimized end-to-end.

Parameter	Typical value
Epochs	3-5
Learning rate	1e-5 (cosine decay)
Frozen layers	None