The Preprocessing module is a high-performance C++ pipeline that converts millions of raw chess positions (FEN strings with evaluation labels) into compact binary datasets that the training pipeline can consume directly. It is written in C++ rather than Python because feature extraction from bitboard representations benefits enormously from low-level optimisations.

Overview

The preprocessing pipeline performs three key tasks:

Parse FEN strings and Stockfish evaluation labels from CSV input.
Extract spatial feature planes from each position (piece maps, distance metrics, pawn structures, slider alignments).
Serialise the features and labels into a packed binary format optimised for random-access by the PyTorch dataloader.

A typical run processes 100M+ positions in under an hour on a modern workstation.

Detailed documentation for each stage:

Feature Extraction — Spatial feature planes computed per position
Binary Output and Pipeline — Binary serialisation and pipeline usage

Directory Layout

preprocessing/
├── CMakeLists.txt / Makefile
├── src/
│   ├── main.cpp                  # Entry point, CSV reader
│   ├── FactorizedFeatureExtractor.cpp
│   ├── FeatureExtractor.cpp      # Legacy v1 extractor
│   └── Writers.cpp               # Binary serialisation
├── include/
│   ├── FactorizedFeatureExtractor.hpp
│   ├── FeatureExtractor.hpp
│   ├── ExpertRouter.hpp          # Position → expert label
│   ├── EvalNormalizer.hpp        # Centipawn normalisation
│   ├── WDLNormalizer.hpp         # WDL target computation
│   └── Writers.hpp
└── lib/chess/                    # Lightweight chess logic