The Preprocessing module is a high-performance C++ pipeline that converts millions of raw chess positions (FEN strings with evaluation labels) into compact binary datasets that the training pipeline can consume directly. It is written in C++ rather than Python because feature extraction from bitboard representations benefits enormously from low-level optimisations.
Overview
The preprocessing pipeline performs three key tasks:
- Parse FEN strings and Stockfish evaluation labels from CSV input.
- Extract spatial feature planes from each position (piece maps, distance metrics, pawn structures, slider alignments).
- Serialise the features and labels into a packed binary format optimised for random-access by the PyTorch dataloader.
A typical run processes 100M+ positions in under an hour on a modern workstation.
Detailed documentation for each stage:
Directory Layout
preprocessing/
├── CMakeLists.txt / Makefile
├── src/
│ ├── main.cpp # Entry point, CSV reader
│ ├── FactorizedFeatureExtractor.cpp
│ ├── FeatureExtractor.cpp # Legacy v1 extractor
│ └── Writers.cpp # Binary serialisation
├── include/
│ ├── FactorizedFeatureExtractor.hpp
│ ├── FeatureExtractor.hpp
│ ├── ExpertRouter.hpp # Position → expert label
│ ├── EvalNormalizer.hpp # Centipawn normalisation
│ ├── WDLNormalizer.hpp # WDL target computation
│ └── Writers.hpp
└── lib/chess/ # Lightweight chess logic