Overview

shuffle_dataset.sh is a Bash script that performs high-performance, seeded shuffling of multi-gigabyte CSV files. Training on sequentially ordered data (positions from the same game grouped together) leads to poor convergence, so thorough shuffling is essential before feeding data to the preprocessing pipeline.

Algorithm

The script uses a pipeline approach that keeps memory usage bounded:

Extract the CSV header and write it to the output file.
Prepend a random key to each row using awk with a fixed seed for reproducibility.
Sort by the random key using GNU sort with multi-threaded parallel merge (--parallel) and bounded RAM (-S).
Strip the random key prefix using cut.

Progress is tracked via pv (pipe viewer), which displays throughput and a progress bar based on the total line count.

Configuration

The following parameters can be edited at the top of the script:

Parameter	Default	Description
INPUT	chess_dataset.csv	Path to the input CSV file
OUTPUT	shuffled.csv	Path to the output shuffled CSV
SEED	42	Random seed for reproducible shuffling
CORES	4	Number of parallel sort threads
RAM	4G	Maximum RAM for the sort buffer
TMP_DIR	.	Directory for temporary sort files

Usage

cd data
bash shuffle_dataset.sh
 
# Output: shuffled.csv

Note: The progress bar will reach 100% and then appear to "pause" while GNU sort performs its final merge pass. This is normal — do not cancel the script at this point.