Parsing large log files in Haskell
Both readFile
and hGetContents
should be lazy. Try running your program with +RTS -s
and see how much memory is actually used. What makes you think the entire file is read into memory?
As for the second part of your question, lazy IO is sometimes at the root of unexpected space leaks or resource leaks. Not really the fault of lazy IO in and of itself, but determining whether its leaky requires analyzing how it's used.
Please, don't use String
s (especially when processing >100 Mb files).
Just replace them with ByteString
s (or Data.Text
):
{-# LANGUAGE OverloadedStrings #-}
import Control.Monad
import System.Environment
import qualified Data.ByteString.Lazy.Char8 as B
main = do
filename <- liftM getArgs
contents <- liftM B.lines $ B.readFile filename
B.putStrLn . B.unlines . filter (B.isPrefixOf "import") $ contents
And I bet, this will be several times faster.
UPD: regarding your follow-up question.
Amount of allocated memory is strongly connected to the magic speedup when switching to bytestrings.
As String
is just a generic list, it requires extra memory for each Char
: pointer to next element, object header, etc. All this memory needs to be allocated and then collected back. This requires a lot of computational power.
On the other side, ByteString
is a list of chunks, i.e. continuous blocks of memory (I think, not less than 64 bytes each). This greatly reduces number of allocations and collections, and improves cache locality also.