Read a very large text file into a list in clojure
Andrew's solution worked well for me, but nested defn
s are not so idiomatic, and you don't need to do lazy-seq
twice: here is an updated version without the extra prints and using letfn
:
(defn lazy-file-lines [file]
(letfn [(helper [rdr]
(lazy-seq
(if-let [line (.readLine rdr)]
(cons line (helper rdr))
(do (.close rdr) nil))))]
(helper (clojure.java.io/reader file))))
(count (lazy-file-lines "/tmp/massive-file.txt"))
;=> <a large integer>
There are various ways of doing this, depending on exactly what you want.
If you have a function
that you want to apply to each line in a file, you can use code similar to Abhinav's answer:
(with-open [rdr ...]
(doall (map function (line-seq rdr))))
This has the advantage that the file is opened, processed, and closed as quickly as possible, but forces the entire file to be consumed at once.
If you want to delay processing of the file you might be tempted to return the lines, but this won't work:
(map function ; broken!!!
(with-open [rdr ...]
(line-seq rdr)))
because the file is closed when with-open
returns, which is before you lazily process the file.
One way around this is to pull the entire file into memory with slurp
:
(map function (slurp filename))
That has an obvious disadvantage - memory use - but guarantees that you don't leave the file open.
An alternative is to leave the file open until you get to the end of the read, while generating a lazy sequence:
(ns ...
(:use clojure.test))
(defn stream-consumer [stream]
(println "read" (count stream) "lines"))
(defn broken-open [file]
(with-open [rdr (clojure.java.io/reader file)]
(line-seq rdr)))
(defn lazy-open [file]
(defn helper [rdr]
(lazy-seq
(if-let [line (.readLine rdr)]
(cons line (helper rdr))
(do (.close rdr) (println "closed") nil))))
(lazy-seq
(do (println "opening")
(helper (clojure.java.io/reader file)))))
(deftest test-open
(try
(stream-consumer (broken-open "/etc/passwd"))
(catch RuntimeException e
(println "caught " e)))
(let [stream (lazy-open "/etc/passwd")]
(println "have stream")
(stream-consumer stream)))
(run-tests)
Which prints:
caught #<RuntimeException java.lang.RuntimeException: java.io.IOException: Stream closed>
have stream
opening
closed
read 29 lines
Showing that the file wasn't even opened until it was needed.
This last approach has the advantage that you can process the stream of data "elsewhere" without keeping everything in memory, but it also has an important disadvantage - the file is not closed until the end of the stream is read. If you are not careful you may open many files in parallel, or even forget to close them (by not reading the stream completely).
The best choice depends on the circumstances - it's a trade-off between lazy evaluation and limited system resources.
PS: Is lazy-open
defined somewhere in the libraries? I arrived at this question trying to find such a function and ended up writing my own, as above.