io-chess
UCI chess engine
Loading...
Searching...
No Matches
Data Pipeline

The Data Pipeline provides utilities for acquiring, filtering, and shuffling the chess evaluation dataset used by the Preprocessing and Training modules. All tools are designed for out-of-core operation and can handle datasets that exceed available RAM.

Overview

The data flow proceeds in two stages:

┌────────────────────────┐ ┌─────────────────────┐
│ Download & Filter │────▶│ Shuffle │
│ prepare_dataset.py │ │ shuffle_dataset.sh │
│ (HuggingFace → CSV) │ │ (parallel, seeded) │
└────────────────────────┘ └─────────────────────┘

Directory Layout

data/
├── prepare_dataset.py # Download, filter, and export to CSV
├── shuffle_dataset.sh # Parallel out-of-core shuffling
├── depth_histogram.png # Evaluation depth distribution (generated)
└── processed/ # Cached/processed dataset artifacts

Dataset

The project uses the mateuszgrzyb/lichess-stockfish-normalized dataset hosted on HuggingFace. This dataset contains millions of chess positions from Lichess games, each annotated with a Stockfish evaluation at varying search depths.

Field Type Description
fen string Board position in Forsyth-Edwards Notation
cp int / null Stockfish centipawn evaluation (null if mate)
mate int / null Mate-in-N score (null if not a forced mate)
depth int Stockfish search depth used for the evaluation

The dataset is loaded via the HuggingFace datasets library using memory-mapped access, meaning the full dataset is never loaded into RAM at once.