Randomly draw a certain number of lines from a data file
This might not be the most efficient way but it works:
shuf <file> > tmp
head -n $m tmp > out1
tail -n +$(( m + 1 )) tmp > out2
With $m
containing the number of lines.
This bash/awk script chooses lines at random, and maintains the original sequence in both output files.
awk -v m=4 -v N=$(wc -l <file) -v out1=/tmp/out1 -v out2=/tmp/out2 \
'BEGIN{ srand()
do{ lnb = 1 + int(rand()*N)
if ( !(lnb in R) ) {
R[lnb] = 1
ct++ }
} while (ct<m)
} { if (R[NR]==1) print > out1
else print > out2
}' file
cat /tmp/out1
echo ========
cat /tmp/out2
Output, based ont the data in the question.
12345
23456
200
600
========
67891
-20000
20
As with all things Unix, There's a Utility for ThatTM.
Program of the day: split
split
will split a file in many different ways, -b
bytes, -l
lines, -n
number of output files. We will be using the -l
option. Since you want to pick random lines and not just the first m
, we'll sort
the file randomly first. If you want to read about sort
, refer to my answer here.
Now, the actual code. It's quite simple, really:
sort -R input_file | split -l $m output_prefix
This will make two files, one with m
lines and one with N-m
lines, named output_prefixaa
and output_prefixab
.
Make sure m
is the larger file you want or you'll get several files of length m
(and one with N % m
).
If you want to ensure that you use the correct size, here's a little code to do that:
m=10 # size you want one file to be
N=$(wc -l input_file)
m=$(( m > N/2 ? m : N - m ))
sort -R input_file | split -l $m output_prefix
Edit: It has come to my attention that some sort
implementations don't have a -R
flag. If you have perl
, you can substitute perl -e 'use List::Util qw/shuffle/; print shuffle <>;'
.