Binary File Format

The Writers module serialises extracted features and labels into a custom binary .bin format optimised for fast random-access by the PyTorch dataloader. Each record in the file contains:

Field	Type	Size	Description
Feature planes	float32	N × 8 × 8	Dense spatial features
Evaluation target	int16	2 bytes	Stockfish centipawn evaluation
WDL target	float32[3]	12 bytes	Win/Draw/Loss probabilities
Expert label	uint8	1 byte	Routing category (0-3)

Records are fixed-size, enabling O(1) random access via memory-mapped I/O in the dataloader. No per-record headers or delimiters are used.

Target Normalisation

Raw Stockfish centipawn evaluations span a wide range (±10000+) and contain outliers (forced mates, tablebase scores). The preprocessing pipeline applies two normalisation steps:

EvalNormalizer: Clamps evaluations to a configurable centipawn range and scales them to [-1, +1].
WDLNormalizer: Converts centipawn scores to Win/Draw/Loss probability vectors using a logistic model calibrated to match real game outcomes at different evaluation ranges.

See also: EvalNormalizer, WDLNormalizer

Running the Pipeline

The preprocessing binary reads FEN + evaluation CSV lines from stdin and writes packed .bin files to disk:

# Basic usage
zstdcat lichess_evals.csv.zst | ./preprocess --output train.bin --format factorized
 
# Multi-drive parallel generation
./split_preprocess_two_drives.sh input.csv /mnt/ssd1/out /mnt/ssd2/out

For very large datasets (100M+ positions), the included split_preprocess_two_drives.sh script splits the input across two output drives to maximise I/O throughput.

See also: Writers