|
io-chess
UCI chess engine
|
In this architecture, routing is algorithmic and static — it is computed during the C++ preprocessing phase rather than by a trainable gating network. This guarantees stable, predictable routing and eliminates the need for load-balancing losses during training.
To build a strong shared representation while allowing experts to specialise, training is split into three phases.
Goal: Build a strong shared representation before specialisation.
In this phase, only the 1×1 mixer, the global feature head, and a single shared expert body are trained. All positions in the dataset contribute equally, and the algorithmic routing labels are ignored.
This ensures that the shared layers learn a universal representation of chess positions. At the end of this phase, the weights of the single trained expert are copied to initialise all the individual experts for Phase 2.
| Parameter | Typical value |
|---|---|
| Epochs | 8-12 |
| Learning rate | 1e-3 (cosine decay) |
| Batch size | 4096 |
| Loss | WDL cross-entropy + eval MSE |
Goal: Fine-tune each expert body independently on its assigned subset of positions.
The shared backbone (mixer + global features) is frozen. The static routing decisions computed during preprocessing determine which positions each expert sees during training. Each expert is fine-tuned independently:
| Expert | Specialisation | Training subset |
|---|---|---|
| Expert 0 | Tactical | Positions with checks, captures, threats, hanging pieces |
| Expert 1 | Strategical | Stable middlegame positions with structural imbalances |
| Expert 2 | Major-piece endgame | Rook endings, queen endings, rook+pawn endings |
| Expert 3 | Minor-piece endgame | Bishop/knight endings, opposite-colour bishop endings |
The ExpertChunkedSampler in dataset.py ensures each expert only sees positions matching its algorithmic routing label.
| Parameter | Typical value |
|---|---|
| Epochs per expert | 5-8 |
| Learning rate | 1e-4 |
| Frozen layers | Mixer, global features |
Goal: Joint fine-tuning of the entire network for final convergence.
All layers — mixer and all experts — are unfrozen and trained jointly at a low learning rate. This allows the components to adapt to each other and resolve any inconsistencies introduced by the independent phase 2 training.
The algorithmic router continues to dictate the forward pass routing, but the entire network's weights are jointly optimized end-to-end.
| Parameter | Typical value |
|---|---|
| Epochs | 3-5 |
| Learning rate | 1e-5 (cosine decay) |
| Frozen layers | None |