How to randomly sample a subset of a file

The shuf command (part of coreutils) can do this:

shuf -n 1000 file

And at least for now non-ancient versions (added in a commit from 2013), that will use reservoir sampling when appropriate, meaning it shouldn't run out of memory and is using a fast algorithm.

If you have a very large file (which is a common reason to take a sample) you will find that:

shuf exhausts memory
Using $RANDOM won't work correctly if the file exceeds 32767 lines

If you don't need "exactly" n sampled lines you can sample a ratio like this:

cat input.txt | awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01) print $0}' > sample.txt

This uses constant memory, samples 1% of the file (if you know the number of lines of the file you can adjust this factor to sample a close to a limited number of lines), and works with any size of file but it will not return a precise number of lines, just a statistical ratio.

Note: The code comes from: https://stackoverflow.com/questions/692312/randomly-pick-lines-from-a-file-without-slurping-it-with-unix

Similar to @Txangel's probabilistic solution but approaching 100x faster.

Click to copy

perl -ne 'print if (rand() < .01)' huge_file.csv > sample.csv

If you need high performance, an exact sample size, and are happy to live with a sample gap at end of the file, you can do something like the following (samples 1000 lines from a 1m line file):

Click to copy

perl -ne 'print if (rand() < .0012)' huge_file.csv | head -1000 > sample.csv

.. or indeed chain a second sample method instead of head.

How to randomly sample a subset of a file

Tags:

Command Line

Command

Files

Related

Recent Posts