|
io-chess
UCI chess engine
|
shuffle_dataset.sh is a Bash script that performs high-performance, seeded shuffling of multi-gigabyte CSV files. Training on sequentially ordered data (positions from the same game grouped together) leads to poor convergence, so thorough shuffling is essential before feeding data to the preprocessing pipeline.
The script uses a pipeline approach that keeps memory usage bounded:
Progress is tracked via pv (pipe viewer), which displays throughput and a progress bar based on the total line count.
The following parameters can be edited at the top of the script:
| Parameter | Default | Description |
|---|---|---|
| INPUT | chess_dataset.csv | Path to the input CSV file |
| OUTPUT | shuffled.csv | Path to the output shuffled CSV |
| SEED | 42 | Random seed for reproducible shuffling |
| CORES | 4 | Number of parallel sort threads |
| RAM | 4G | Maximum RAM for the sort buffer |
| TMP_DIR | . | Directory for temporary sort files |