Read n random lines from a potentially huge file
Ruby, 104 94 92 90 bytes
File name and number of lines are passed into the command line. For example, if the program is shuffle.rb
and the file name is a.txt
, run ruby shuffle.rb a.txt 3
for three random lines.
-4 bytes from discovering the open
syntax in Ruby instead of File.new
f=open$*[0]
puts [*0..f.size/n=f.gets.size+1].sample($*[1].to_i).map{|e|f.seek n*e;f.gets}
Also, here's a 85-byte anonymous function solution that takes a string and a number as its arguments.
->f,l{f=open f;puts [*0..f.size/n=f.gets.size+1].sample(l).map{|e|f.seek n*e;f.gets}}
Dyalog APL, 63 bytes
⎕NREAD¨t 82l∘,¨lׯ1+⎕?(⎕NSIZE t)÷l←10⍳⍨⎕NREAD 83 80,⍨t←⍞⎕NTIE 0
Prompts for file name, then for how many random lines are desired.
Explanation
⍞
Prompt for text input (file name)
⎕NTIE 0
Tie the file using next available tie number (-1 on a clean system)
t←
Store the chosen tie number as t
83 80,⍨
Append [83,80] yielding [-1,83,80]
⎕NREAD
Read the first 80 bytes of file -1 as 8-bit integers (conversion code 83)
10⍳⍨
Find the index of the first number 10 (LF)
l←
Store the line length as l
(⎕NSIZE t)÷
Divide the size of file -1 with the line length
⎕
Prompt for numeric input (desired number of lines)
?
X random selections (without replacement) out the first Y natural numbers
¯1+
Add -1 to get 0-origin line numbers*
l×
Multiply by the line length to get the start bytes
t 82l∘,¨
Prepend [-1,82,LineLength] to each start byte (creates list of arguments for ⎕NREAD
)
⎕NREAD¨
Read each line as 8-bit character (conversion code 82)
Practical example
File /tmp/records.txt contains:
Hello
Think
12345
Klaus
Nilad
Make the program RandLines contain the above code verbatim by entering the following into the APL session:
∇RandLines
⎕NREAD¨t 82l∘,¨lׯ1+⎕?(⎕NSIZE t)÷l←10⍳⍨⎕NREAD 83 80,⍨t←⍞⎕NTIE 0
∇
In the APL session type RandLines
and press Enter.
The system moves the cursor to the next line, which is a 0-length prompt for character data; enter /tmp/records.txt
.
The system now outputs ⎕:
and awaits numeric input; enter 4
.
The system outputs four random lines.
Real life
In reality, you may want to give filename and count as arguments and receive the result as a table. This can be done by entering:
RandLs←{↑⎕NREAD¨t 82l∘,¨lׯ1+⍺?(⎕NSIZE t)÷l←10⍳⍨⎕NREAD 83 80,⍨t←⍵⎕NTIE 0}
Now you make MyLines contain three random lines with:
MyLines←3 RandLs'/tmp/records.txt'
How about returning just a single random line if count is not specified:
RandL←{⍺←1 ⋄ ↑⎕NREAD¨t 82l∘,¨lׯ1+⍺?(⎕NSIZE t)÷l←10⍳⍨⎕NREAD 83 80,⍨t←⍵⎕NTIE 0}
Now you can do both:
MyLines←2 RandL'/tmp/records.txt'
and (notice absence of left argument):
MyLine←RandL'/tmp/records.txt'
Making code readable
Golfed APL one-liners are a bad idea. Here is how I would write in a production system:
RandL←{ ⍝ Read X random lines from file Y without reading entire file
⍺←1 ⍝ default count
tie←⍵⎕NTIE 0 ⍝ tie file
length←10⍳⍨⎕NREAD 83 80,⍨tie ⍝ find first NL
size←⎕NSIZE tie ⍝ total file length
starts←lengthׯ1+⍺?size÷length ⍝ beginning of each line
↑⎕NREAD¨tie 82length∘,¨starts ⍝ read each line as character and convert list to table
}
*I could save a byte by running in 0-origin mode, which is standard on some APL systems: remove ¯1+
and insert 1+
before 10
.
Haskell, 240 224 236 bytes
import Test.QuickCheck
import System.IO
g=hGetLine
main=do;f<-getLine;n<-readLn;h<-openFile f ReadMode;l<-(\x->1+sum[1|_<-x])<$>g h;s<-hFileSize h;generate(shuffle[0..div s l-1])>>=mapM(\p->hSeek h(toEnum 0)(l*p)>>g h>>=putStrLn).take n
Reads filename and n from stdin.
How it works:
main=do
f<-getLine -- read file name from stdin
n<-readLn -- read n from stdin
h<-openFile f ReadMode -- open the file
l<-(\x->1+sum[1|_<-x])<$>g h -- read first line and bind l to it's length +1
-- sum[1|_<-x] is a custom length function
-- because of type restrictions, otherwise I'd have
-- to use "toInteger.length"
s<-hFileSize h -- get file size
generate(shuffle[0..div s l-1])>>=
-- shuffle all possible line numbers
mapM (\->p ... ).take n -- for each of the first n shuffled line numbers
hSeek h(toEnum 0).(l*p)>> -- jump to that line ("toEnum 0" is short for "AbsoluteSeek")
g h>>= -- read a line from current position
putStrLn -- and print
It takes a lot of time and memory to run this program for files with many lines, because of a horrible inefficient shuffle
function.
Edit: I missed the "random without replacement" part (thanks @feersum for noticing!).