io-chess
UCI chess engine
Loading...
Searching...
No Matches
Multi-Phase Training Curriculum

In this architecture, routing is algorithmic and static — it is computed during the C++ preprocessing phase rather than by a trainable gating network. This guarantees stable, predictable routing and eliminates the need for load-balancing losses during training.

To build a strong shared representation while allowing experts to specialise, training is split into three phases.

Phase 1 — Backbone Pre-Training

Goal: Build a strong shared representation before specialisation.

In this phase, only the 1×1 mixer, the global feature head, and a single shared expert body are trained. All positions in the dataset contribute equally, and the algorithmic routing labels are ignored.

This ensures that the shared layers learn a universal representation of chess positions. At the end of this phase, the weights of the single trained expert are copied to initialise all the individual experts for Phase 2.

Parameter Typical value
Epochs 8-12
Learning rate 1e-3 (cosine decay)
Batch size 4096
Loss WDL cross-entropy + eval MSE

Phase 2 — Expert Specialisation

Goal: Fine-tune each expert body independently on its assigned subset of positions.

The shared backbone (mixer + global features) is frozen. The static routing decisions computed during preprocessing determine which positions each expert sees during training. Each expert is fine-tuned independently:

Expert Specialisation Training subset
Expert 0 Tactical Positions with checks, captures, threats, hanging pieces
Expert 1 Strategical Stable middlegame positions with structural imbalances
Expert 2 Major-piece endgame Rook endings, queen endings, rook+pawn endings
Expert 3 Minor-piece endgame Bishop/knight endings, opposite-colour bishop endings

The ExpertChunkedSampler in dataset.py ensures each expert only sees positions matching its algorithmic routing label.

Parameter Typical value
Epochs per expert 5-8
Learning rate 1e-4
Frozen layers Mixer, global features

Phase 4 — Joint Fine-Tuning (Top-2)

Goal: Joint fine-tuning of the entire network for final convergence.

All layers — mixer and all experts — are unfrozen and trained jointly at a low learning rate. This allows the components to adapt to each other and resolve any inconsistencies introduced by the independent phase 2 training.

The algorithmic router continues to dictate the forward pass routing, but the entire network's weights are jointly optimized end-to-end.

Parameter Typical value
Epochs 3-5
Learning rate 1e-5 (cosine decay)
Frozen layers None