|
io-chess
UCI chess engine
|
prepare_dataset.py is a Python script that handles both downloading and filtering the chess evaluation dataset in a single pass. It streams the dataset from HuggingFace, applies quality filters, and writes the result to a clean CSV file ready for the preprocessing pipeline.
The script downloads the mateuszgrzyb/lichess-stockfish-normalized dataset from HuggingFace using the datasets library:
The HuggingFace library caches the dataset locally after the first download (~50 GB compressed), so subsequent runs skip the download step entirely.
The script applies the following quality filters during processing:
| Filter | Condition | Rationale |
|---|---|---|
| Minimum depth | depth > 10 | Shallow evaluations are unreliable and add noise to training |
Positions with a mate score (mate field is not null) are retained and formatted as #N (e.g. #3 for mate-in-3). All other positions use the raw centipawn value.
The script writes a two-column CSV file:
This CSV is the input for the C++ preprocessing binary which converts the FEN strings into packed binary feature datasets.
After processing, the script generates a histogram (depth_histogram.png) showing the distribution of evaluation depths across the filtered dataset. This helps verify that shallow evaluations were correctly removed and that the remaining data has sufficient depth.
The script processes the dataset in batches of 100,000 rows using dataset.iter(batch_size=100_000), which iterates over the memory-mapped file without loading the entire dataset into RAM. Each batch is filtered, formatted, and flushed to disk immediately, keeping memory usage constant at approximately 200-400 MB regardless of dataset size.