The Data Pipeline provides utilities for acquiring, filtering, and shuffling the chess evaluation dataset used by the Preprocessing and Training modules. All tools are designed for out-of-core operation and can handle datasets that exceed available RAM.

Overview

The data flow proceeds in two stages:

┌────────────────────────┐     ┌─────────────────────┐
│  Download & Filter     │────▶│      Shuffle         │
│  prepare_dataset.py    │     │  shuffle_dataset.sh  │
│  (HuggingFace → CSV)  │     │  (parallel, seeded)  │
└────────────────────────┘     └─────────────────────┘

Downloading and Filtering — Downloading and filtering the dataset
Shuffling — Parallel out-of-core shuffling for training

Directory Layout

data/
├── prepare_dataset.py     # Download, filter, and export to CSV
├── shuffle_dataset.sh     # Parallel out-of-core shuffling
├── depth_histogram.png    # Evaluation depth distribution (generated)
└── processed/             # Cached/processed dataset artifacts

Dataset

The project uses the mateuszgrzyb/lichess-stockfish-normalized dataset hosted on HuggingFace. This dataset contains millions of chess positions from Lichess games, each annotated with a Stockfish evaluation at varying search depths.

Field	Type	Description
fen	string	Board position in Forsyth-Edwards Notation
cp	int / null	Stockfish centipawn evaluation (null if mate)
mate	int / null	Mate-in-N score (null if not a forced mate)
depth	int	Stockfish search depth used for the evaluation

The dataset is loaded via the HuggingFace datasets library using memory-mapped access, meaning the full dataset is never loaded into RAM at once.