Overview

prepare_dataset.py is a Python script that handles both downloading and filtering the chess evaluation dataset in a single pass. It streams the dataset from HuggingFace, applies quality filters, and writes the result to a clean CSV file ready for the preprocessing pipeline.

Data Source

The script downloads the mateuszgrzyb/lichess-stockfish-normalized dataset from HuggingFace using the datasets library:

dataset = load_dataset("mateuszgrzyb/lichess-stockfish-normalized", split="train")

The HuggingFace library caches the dataset locally after the first download (~50 GB compressed), so subsequent runs skip the download step entirely.

Filtering

The script applies the following quality filters during processing:

Filter	Condition	Rationale
Minimum depth	depth > 10	Shallow evaluations are unreliable and add noise to training

Positions with a mate score (mate field is not null) are retained and formatted as #N (e.g. #3 for mate-in-3). All other positions use the raw centipawn value.

Output Format

The script writes a two-column CSV file:

fen,eval
rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1,35
r1bqkb1r/pppppppp/2n2n2/8/4P3/5N2/PPPP1PPP/RNBQKB1R w KQkq - 2 3,52
...

This CSV is the input for the C++ preprocessing binary which converts the FEN strings into packed binary feature datasets.

Depth Distribution

After processing, the script generates a histogram (depth_histogram.png) showing the distribution of evaluation depths across the filtered dataset. This helps verify that shallow evaluations were correctly removed and that the remaining data has sufficient depth.

Usage

# Install dependencies
pip install datasets tqdm matplotlib
 
# Run the script (downloads on first run, ~50 GB)
cd data
python prepare_dataset.py
 
# Output: chess_dataset.csv + depth_histogram.png

Memory Efficiency

The script processes the dataset in batches of 100,000 rows using dataset.iter(batch_size=100_000), which iterates over the memory-mapped file without loading the entire dataset into RAM. Each batch is filtered, formatted, and flushed to disk immediately, keeping memory usage constant at approximately 200-400 MB regardless of dataset size.