Parsing large log files in Haskell

Both readFile and hGetContents should be lazy. Try running your program with +RTS -s and see how much memory is actually used. What makes you think the entire file is read into memory?

As for the second part of your question, lazy IO is sometimes at the root of unexpected space leaks or resource leaks. Not really the fault of lazy IO in and of itself, but determining whether its leaky requires analyzing how it's used.


Please, don't use Strings (especially when processing >100 Mb files). Just replace them with ByteStrings (or Data.Text):

{-# LANGUAGE OverloadedStrings #-}

import Control.Monad
import System.Environment
import qualified Data.ByteString.Lazy.Char8 as B

main = do
  filename <- liftM getArgs
  contents <- liftM B.lines $ B.readFile filename
  B.putStrLn . B.unlines . filter (B.isPrefixOf "import") $ contents

And I bet, this will be several times faster.

UPD: regarding your follow-up question.
Amount of allocated memory is strongly connected to the magic speedup when switching to bytestrings.
As String is just a generic list, it requires extra memory for each Char: pointer to next element, object header, etc. All this memory needs to be allocated and then collected back. This requires a lot of computational power.
On the other side, ByteString is a list of chunks, i.e. continuous blocks of memory (I think, not less than 64 bytes each). This greatly reduces number of allocations and collections, and improves cache locality also.

Tags:

Haskell