io-chess
UCI chess engine
Loading...
Searching...
No Matches
Binary Output and Pipeline

Binary File Format

The Writers module serialises extracted features and labels into a custom binary .bin format optimised for fast random-access by the PyTorch dataloader. Each record in the file contains:

Field Type Size Description
Feature planes float32 N × 8 × 8 Dense spatial features
Evaluation target int16 2 bytes Stockfish centipawn evaluation
WDL target float32[3] 12 bytes Win/Draw/Loss probabilities
Expert label uint8 1 byte Routing category (0-3)

Records are fixed-size, enabling O(1) random access via memory-mapped I/O in the dataloader. No per-record headers or delimiters are used.

Target Normalisation

Raw Stockfish centipawn evaluations span a wide range (±10000+) and contain outliers (forced mates, tablebase scores). The preprocessing pipeline applies two normalisation steps:

  • EvalNormalizer: Clamps evaluations to a configurable centipawn range and scales them to [-1, +1].
  • WDLNormalizer: Converts centipawn scores to Win/Draw/Loss probability vectors using a logistic model calibrated to match real game outcomes at different evaluation ranges.
See also
EvalNormalizer, WDLNormalizer

Running the Pipeline

The preprocessing binary reads FEN + evaluation CSV lines from stdin and writes packed .bin files to disk:

# Basic usage
zstdcat lichess_evals.csv.zst | ./preprocess --output train.bin --format factorized
# Multi-drive parallel generation
./split_preprocess_two_drives.sh input.csv /mnt/ssd1/out /mnt/ssd2/out

For very large datasets (100M+ positions), the included split_preprocess_two_drives.sh script splits the input across two output drives to maximise I/O throughput.

See also
Writers