Speeding up Import and Export in CSV format

Here's a much faster, purely Mathematica way than using Import to import your data:

UPDATE

As Leonid mentioned the previous code doesn't exactly replicate Import. The truth is I was only trying to retrieve the numerical part. Here's an updated version that tries to replicate the output from Import.

readYourCSV2[file_String?FileExistsQ, n_Integer] := Module[{str = OpenRead[file], data}, 
  data = ReadList[str, Table[Record, {n}], RecordSeparators -> {",", "\n"}]; 
  Close[str]; 
  ReleaseHold[ToExpression[data, InputForm, Hold] /. {Plus[Times[x_, E | e], y_] :> x * 10 ^ y}] 
 ]

Here, n is the number of columns.

UPDATE 2

Now for the Export, here's a fast, again, purely Mathematica way to export in CSV format.

writeYourCSV[file_String, list_List?MatrixQ] := 
 With[{str = OpenWrite[file, PageWidth -> Infinity], len = Length[ list[[1]] ]},
      Scan[Write[str, Sequence @@ (Flatten[Table[{FortranForm[ #[[i]] ], OutputForm[","]}, 
              {i, len - 1}]]) ~ Join ~ { FortranForm[ #[[len]] ] }] &, list]; Close[str];
]

This takes less than 10 seconds to write your large data back to CSV format:

writeYourCSV["testcsv.csv", databig] // AbsoluteTiming

{9.921969, Null}

Here is a Java-based solution, pretty fast but valid only when all your columns are numerical (double) values.

First, grab and run the code for the Java reloader (The linked version should work on Windows and probably Linux, but was reported to have issues for OS X. So, Mac users may try this one instead: Import["https://gist.github.com/lshifr/7307845/raw/SimpleJavaReloader.m"] - not yet tested this version for other platforms). Then, compile this class:

JCompileLoad["public class DoubleParser{
   public static double[] parseDouble(String[] strdub){
      double[] res = new double[strdub.length];
      int i = 0;
      for(;i < strdub.length;i++){
         try{
            res[i]= Double.parseDouble(strdub[i]);
         } catch (NumberFormatException e){
            res[i] = 0;
         }
      }
      return res;
   }
}"]

Then, here is the Mathematica counterpart:

ClearAll[importDoubleCSV];
Options[importDoubleCSV]={"Headers"->True};
importDoubleCSV[file_String?FileExistsQ, opts : OptionsPattern[]]:=
  With[{fn=If[TrueQ[OptionValue["Headers"]],Rest,Identity]},
    Transpose[
      DoubleParser`parseDouble/@
        Transpose[
          DeleteCases[                
            StringSplit[fn[StringSplit[FromCharacterCode[BinaryReadList[file]],"\n"]],","],
            {s_String/;StringMatchQ[s,Whitespace]}
          ]
        ]
    ]
  ]

For your small file, the result agrees with what you get by using Import, after you remove empty rows:

res1=Rest[DeleteCases[Import["~/Downloads/returns_out_small.csv"],{""..}]];
res2=importDoubleCSV["~/Downloads/returns_out_small.csv"];
res1==res2

(* True *)

Your large file gets processed on my machine in about 4 seconds:

(resLrg=importDoubleCSV["~/Downloads/returns_out.csv"])//Short//AbsoluteTiming

(* 
    {4.789668,
     {{0.000449449,0.000418204,<<1415>>,0.000064701,0.000045972},
        <<1417>>,{<<1>>}}
    }
*)

which doesn't look bad to me. I wasn't patient enough to wait until Import["~/Downloads/returns_out.csv"] finishes, so did not compare results in this case - but the reader is most welcome to do that (and the timings too).

An added advantage here is that we get the results packed:

Developer`PackedArrayQ @ resLrg

(* True *)

Note that Java parsing code adopts a convention to replace all non-parsable strings with zeros. It is possible to improve on this, by returning also the positions of non-parsable strings, separately. Note also that the UTF-8 encoding is implicitly assumed.

Speeding up Import and Export in CSV format

Tags:

Import

Export

Performance Tuning

Csv Format

Related

Recent Posts