How to randomly sample a subset of a file
The shuf
command (part of coreutils) can do this:
shuf -n 1000 file
And at least for now non-ancient versions (added in a commit from 2013), that will use reservoir sampling when appropriate, meaning it shouldn't run out of memory and is using a fast algorithm.
If you have a very large file (which is a common reason to take a sample) you will find that:
shuf
exhausts memory- Using
$RANDOM
won't work correctly if the file exceeds 32767 lines
If you don't need "exactly" n sampled lines you can sample a ratio like this:
cat input.txt | awk 'BEGIN {srand()} !/^$/ { if (rand() <= .01) print $0}' > sample.txt
This uses constant memory, samples 1% of the file (if you know the number of lines of the file you can adjust this factor to sample a close to a limited number of lines), and works with any size of file but it will not return a precise number of lines, just a statistical ratio.
Note: The code comes from: https://stackoverflow.com/questions/692312/randomly-pick-lines-from-a-file-without-slurping-it-with-unix
Similar to @Txangel's probabilistic solution but approaching 100x faster.
perl -ne 'print if (rand() < .01)' huge_file.csv > sample.csv
If you need high performance, an exact sample size, and are happy to live with a sample gap at end of the file, you can do something like the following (samples 1000 lines from a 1m line file):
perl -ne 'print if (rand() < .0012)' huge_file.csv | head -1000 > sample.csv
.. or indeed chain a second sample method instead of head
.