How can I shuffle the lines of a text file on the Unix command line or in a shell script?
You can use shuf
. On some systems at least (doesn't appear to be in POSIX).
As jleedev pointed out: sort -R
might also be an option. On some systems at least; well, you get the picture. It has been pointed out that sort -R
doesn't really shuffle but instead sort items according to their hash value.
[Editor's note: sort -R
almost shuffles, except that duplicate lines / sort keys always end up next to each other. In other words: only with unique input lines / keys is it a true shuffle. While it's true that the output order is determined by hash values, the randomness comes from choosing a random hash function - see manual.]
Perl one-liner would be a simple version of Maxim's solution
perl -MList::Util=shuffle -e 'print shuffle(<STDIN>);' < myfile
This answer complements the many great existing answers in the following ways:
The existing answers are packaged into flexible shell functions:
- The functions take not only
stdin
input, but alternatively also filename arguments - The functions take extra steps to handle
SIGPIPE
in the usual way (quiet termination with exit code141
), as opposed to breaking noisily. This is important when piping the function output to a pipe that is closed early, such as when piping tohead
.
- The functions take not only
A performance comparison is made.
- POSIX-compliant function based on
awk
,sort
, andcut
, adapted from the OP's own answer:
shuf() { awk 'BEGIN {srand(); OFMT="%.17f"} {print rand(), $0}' "$@" |
sort -k1,1n | cut -d ' ' -f2-; }
- Perl-based function - adapted from Moonyoung Kang's answer:
shuf() { perl -MList::Util=shuffle -e 'print shuffle(<>);' "$@"; }
- Python-based function, adapted from scai's answer:
shuf() { python -c '
import sys, random, fileinput; from signal import signal, SIGPIPE, SIG_DFL;
signal(SIGPIPE, SIG_DFL); lines=[line for line in fileinput.input()];
random.shuffle(lines); sys.stdout.write("".join(lines))
' "$@"; }
See the bottom section for a Windows version of this function.
- Ruby-based function, adapted from hoffmanc's answer:
shuf() { ruby -e 'Signal.trap("SIGPIPE", "SYSTEM_DEFAULT");
puts ARGF.readlines.shuffle' "$@"; }
Performance comparison:
Note: These numbers were obtained on a late-2012 iMac with 3.2 GHz Intel Core i5 and a Fusion Drive, running OSX 10.10.3. While timings will vary with OS used, machine specs, awk
implementation used (e.g., the BSD awk
version used on OSX is usually slower than GNU awk
and especially mawk
), this should provide a general sense of relative performance.
Input file is a 1-million-lines file produced with seq -f 'line %.0f' 1000000
.
Times are listed in ascending order (fastest first):
shuf
0.090s
- Ruby 2.0.0
0.289s
- Perl 5.18.2
0.589s
- Python
1.342s
with Python 2.7.6;2.407s
(!) with Python 3.4.2
awk
+sort
+cut
3.003s
with BSDawk
;2.388s
with GNUawk
(4.1.1);1.811s
withmawk
(1.3.4);
For further comparison, the solutions not packaged as functions above:
sort -R
(not a true shuffle if there are duplicate input lines)10.661s
- allocating more memory doesn't seem to make a difference
- Scala
24.229s
bash
loops +sort
32.593s
Conclusions:
- Use
shuf
, if you can - it's the fastest by far. - Ruby does well, followed by Perl.
- Python is noticeably slower than Ruby and Perl, and, comparing Python versions, 2.7.6 is quite a bit faster than 3.4.1
- Use the POSIX-compliant
awk
+sort
+cut
combo as a last resort; whichawk
implementation you use matters (mawk
is faster than GNUawk
, BSDawk
is slowest). - Stay away from
sort -R
,bash
loops, and Scala.
Windows versions of the Python solution (the Python code is identical, except for variations in quoting and the removal of the signal-related statements, which aren't supported on Windows):
- For PowerShell (in Windows PowerShell, you'll have to adjust
$OutputEncoding
if you want to send non-ASCII characters via the pipeline):
# Call as `shuf someFile.txt` or `Get-Content someFile.txt | shuf`
function shuf {
$Input | python -c @'
import sys, random, fileinput;
lines=[line for line in fileinput.input()];
random.shuffle(lines); sys.stdout.write(''.join(lines))
'@ $args
}
Note that PowerShell can natively shuffle via its Get-Random
cmdlet (though performance may be a problem); e.g.:
Get-Content someFile.txt | Get-Random -Count ([int]::MaxValue)
- For
cmd.exe
(a batch file):
Save to file shuf.cmd
, for instance:
@echo off
python -c "import sys, random, fileinput; lines=[line for line in fileinput.input()]; random.shuffle(lines); sys.stdout.write(''.join(lines))" %*