io-chess
UCI chess engine
Loading...
Searching...
No Matches
Shuffling

Overview

shuffle_dataset.sh is a Bash script that performs high-performance, seeded shuffling of multi-gigabyte CSV files. Training on sequentially ordered data (positions from the same game grouped together) leads to poor convergence, so thorough shuffling is essential before feeding data to the preprocessing pipeline.

Algorithm

The script uses a pipeline approach that keeps memory usage bounded:

  1. Extract the CSV header and write it to the output file.
  2. Prepend a random key to each row using awk with a fixed seed for reproducibility.
  3. Sort by the random key using GNU sort with multi-threaded parallel merge (--parallel) and bounded RAM (-S).
  4. Strip the random key prefix using cut.

Progress is tracked via pv (pipe viewer), which displays throughput and a progress bar based on the total line count.

Configuration

The following parameters can be edited at the top of the script:

Parameter Default Description
INPUT chess_dataset.csv Path to the input CSV file
OUTPUT shuffled.csv Path to the output shuffled CSV
SEED 42 Random seed for reproducible shuffling
CORES 4 Number of parallel sort threads
RAM 4G Maximum RAM for the sort buffer
TMP_DIR . Directory for temporary sort files

Usage

cd data
bash shuffle_dataset.sh
# Output: shuffled.csv
Note
The progress bar will reach 100% and then appear to "pause" while GNU sort performs its final merge pass. This is normal — do not cancel the script at this point.