Stream CSV or TSV files

Here is a function which may help:

Clear[readRows];
readRows[stream_, n_] :=
   With[{str = ReadList[stream, "String", n]},
      ImportString[StringJoin[Riffle[str, "\n"]], "Table"] /; str =!= {}];
readRows[__] := $Failed;

I tested on your file and it works all right (it may make sense to read rows in batches, this is much faster):

n=0;
str = OpenRead["C:\\Temp\\EUR_USD_Week1.csv"];
While[readRows[str, 1000] =!= $Failed, n++];
Close[str];
n

(*  82   *)

By the way, speaking of practicality of Import - I agree, but do read this answer - it is based on the same idea as above code, and makes importing whole files quite practical IMO. In particular, using the readTable function from there, I get your file read in its entirety under 3 seconds on my machine:

In[64]:= readTable["C:\\Temp\\EUR_USD_Week1.csv",2000]//Short//AbsoluteTiming
Out[64]= {2.4775391,{{lTid,cDealable,CurrencyPair,RateDateTime,RateBid,RateAsk},
  <<81172>>,{1385715072,D\[Ellipsis] SD,2011-01-07,\[Ellipsis] }}}

with a very decent memory usage:

MaxMemoryUsed[]

(* 71652808 *)

meaning 50 Mb of net usage (when you subtract the startup memory usage) - for this particular example. You can tweak the second parameter to trade run-time for memory efficiency. See the linked post for more details.


Summary

This is generalized piece of code for reading FAST rows terminated with a newline from a CSV or TSV text stream with UTF8 encoding.

You may specify options for the filetype, the starting and ending line number, as well as a string to search for a position in the file (find). By default the result includes the header which can be omitted.

The import option (True/False) controls whether there is going to be a post-processing of the rows read from the file with the ImportString function. This is slower than the ReadList with user specified types option or with import->False.

If you want to read only the header of the file then specify start->0, end->0

Code

null[_String] := Null
countLines = Length@ReadList[#, null@String, NullRecords -> True] &;

ClearAll[readRows];
Options[readRows]={filetype->"CSV",start->0,end->-1, types->{}, import->True, find->"", header->True};
readRows[fname_String?FileExistsQ,OptionsPattern[]]:=Module[
{txtstrm, ftype,typesval,impmethod,
 startLine,endLine,line1,str, 
 hdr, hdrflag, body, bodyImport, blnk},

(* Blank function for skipping word separators, e.g. commas *)
blnk[]=blnk[Character];
blnk[_String]=Sequence[];

(* Get Option Values *)
startLine=OptionValue[start];endLine=OptionValue[end];
ftype=OptionValue[filetype];typesval=OptionValue[types];
impmethod=OptionValue[import];str=OptionValue[find];
hdrflag=OptionValue[header] ;

(* Open Stream *)
txtstrm=OpenRead[fname];

(* Process Header, i.e. first line *)
line1 = ReadList[txtstrm,Record,1][[1]];
Which[
 ftype=="CSV", hdr = StringSplit[line1,","],
 ftype=="TSV", hdr = StringSplit[line1,"\t"],
 True, Abort[] ];

(* If user has not specified any types for ReadLine, types\[Rule]{} *)
If[Length@typesval==0,typesval= Riffle[Table[Word,Length@hdr],blnk[]]];

(* If user has not specified the line to end reading *)
If[endLine==-1,endLine=countLines@fname];

(* If user specified to start reading the text stream from a specific line *)
If [startLine!=0,Skip[txtstrm,Record,startLine]];

(* If user specified a string to find in a file *)
If[str!="",Find[txtstrm,str]];

(* Check the import option for post-processing each row with ImportString *)
If[impmethod,
 body =ReadList[txtstrm,Record,endLine-startLine]; 
 bodyImport= ImportString[StringJoin[Riffle[body,"\n"]],ftype,CharacterEncoding->"UTF-8"],
(* else *)
  Which[
    ftype=="CSV", body =ReadList[txtstrm,typesval,endLine-startLine,WordSeparators->{"\n",","}],
    ftype=="TSV", body =ReadList[txtstrm,typesval,endLine-startLine,WordSeparators->{"\n","\t"}] ];    
]; (* End of If impmethod *)

(* Close the stream *)
Close[txtstrm];

(* Optionally add the header at the beginning, return the result *)
Which[
    startLine==0 && endLine==0, hdr,
    impmethod && hdrflag,   Prepend[bodyImport, hdr],
    impmethod,              bodyImport,
    hdrflag,                Prepend[body, hdr],
    True,                   body]

] (* End of readRows *)

Test

fname = "ExampleData/financialtimeseries.csv";

Take[Import[fname], 10] // TableForm

(* With ImportString *)
(res = readRows[financialFileName, end->10]) // TableForm
Head /@ res[[2]]

(* With default ReadLine types *)
(res = readRows[financialFileName, end->10, import->False]) // TableForm
Head /@ res[[2]]

(* With User Specified ReadLine types *)
sk[] = sk[Character];
sk[_String] = Sequence[];
mytypes = {Word, sk[], Number};

(res = readRows[financialFileName, end->10, import->False, types->mytypes]) // TableForm
Head /@ res[[2]]

Acknowledgements

  • The countLines function is taken from this answer of Mr. Wizard.

  • The combination of ReadList with ImportString and Riffle is taken from an answer in the same post here by Leonid Shifrin.

  • The best part of the code, which is the fastest reading with ReadLine based on user specified types, is taken from this answer of Mr. Wizard. It would be great if someone analyze how exactly this blank, skipping characters work and whether I am using them appropriately here.

Open Source DBAPI Project

I am planning to include this in my DBAPI project so If you come up with any improvements on the code above or ideas for further enhancements please let me know.

PS1: Is it possible to improve the code so that it can parse quoted strings with embedded commas as in this file "ExampleData/TreesOwnedByTheCityOfChampaign.csv" for example ? What is going to be the types or the blanks to use in ReadLine ?

PS2: I have just fixed the FAST reading of TSV. You can specify types for ReadLine now with these files. But I have a little problem with UTF8 encoding. Characters are not displayed correctly ! Try for example to parse this stream

 tsvList = {{"Small", "Capital", "IntNum", "RealNum"}, {"αβ", "AB", 
5, 2.3`}, {"γδ", "CD", 1, 4.5`}, {"γδ", "CD", 6, 1.6`}, {"χψ", 
"FG", 1, 1.9`}, {"αβ", "AB", 5, 4.5`}};

 tsvString = ExportString[tsvList, "TSV", CharacterEncoding -> "Unicode"]

You can turn the string to a stream and Import that works

 tsvStream = StringToStream@tsvString;

 Import[tsvStream, "TSV"] // TableForm

But I cannot find how to save this string or list expression into a file and preserve both TSV format and Unicode encoding when you read this back with e.g. Import ???

For testing purposes I have just created the file manually and assigned the filename to tsvFilename variable.

 (res = DBreadLines[tsvFilename, filetype -> "TSV", end -> 3, 
import -> False]) // TableForm

Or ReadLine with user specified types

(res = DBreadLines[sampleFilename, filetype -> "TSV", end -> 3, 
import -> False, types -> {Word, Word, Number, Number}]) // TableForm

You get an output such as

Small               Capital IntNum  RealNum
Î\[PlusMinus]Î\.b2  AB      5       2.3
Î\.b3Î\.b4          CD      1       4.5
Î\.b3Î\.b4          CD      6       1.6