split large csv text file based on column value
C++ is fine if you know it best. Why would you try to load the entire file into memory anyways?
Since the output is dependent upon the column being read you could easily store buffers for output files and stuff the record into the appropriate file as you process, cleaning as you go to keep the memory footprint relatively small.
I do this (albeit in java) when needing to take massive extracts from a database. The records are pushed into a file buffer stream and anything in the memory is cleaned up so the footprint of the program never grows beyond what it initially starts out at.
Fly by the seat of my pants pseudo-code:
- Create a list to hold your output file buffers
- Open stream on file and begin reading in the contents one line at a time
- Did we encounter a record that has an open file stream for it's content type yet?
- Yes -
- Get the stored file stream
- store the record into that file
- flush the stream
- No -
- create a stream and save it to our list of streams
- store the record on the stream
- flush the stream
- Yes -
- Rinse repeat...
Basically continuing this processing until we're at the end of the file.
Since we never store more than pointers to the streams and we're flushing as soon as we write to the streams we don't ever hold anything resident in the memory of the application other than one record from the input file. Thus the footprint is kept managable.
Here's an old school one liner for you (just replace the >>
with >
to truncate the output files each run):
awk -F, '{print >> ($3".csv")}' input.csv
Due to popular demand (and an itch I just had), I've also written a version that will duplicate the header lines to all files:
awk -F, 'NR==1 {h=$0; next} {f=$3".csv"} !($3 in p) {p[$3]; print h > f} {print >> f}' input.csv
But you could just start with this and finish with the first awk:
HDR=$(head -1 input.csv); for fn in $(tail -n+2 input.csv | cut -f3 -d, | sort -u); do echo $HDR > $fn.csv; done
Most modern systems have the awk binary included, but if you don't have it, you can find an exe at Gawk for Windows
perl -F, -ane '`echo $_ >> $F[2].csv`' < file
These command-line options are used:
-n
loop around every line of the input file-l
removes newlines before processing, and adds them back in afterwards-a
autosplit mode – split input lines into the@F
array. Defaults to splitting on whitespace.-e
execute the perl code-F
autosplit modifier, in this case splits on,
@F
is the array of words in each line, indexed starting with $F[0]
If you want to retain the header, then a more complicated approach is required.
perl splitintofiles.pl file
Contents of splitintofiles.pl:
open $fh, '<', $ARGV[0];
while ($line = <$fh>) {
print $line;
if ($. == 1) {
$header = $line;
} else {
# $fields[2] is the 3rd column
@fields = split /,/, $line;
# save line into hash %c
$c{"$fields[2].csv"} .= $line;
}
}
close $fh;
for $file (keys %c) {
print "$file\n";
open $fh, '>', $file;
print $fh $header;
print $fh $c{$file};
close $fh;
}
input:
a,b,c,d,e,f,g,h,i,j,k,l
19980102,,PLXS,10032,Q,A,,,15.12500,15.00000,15.12500,2
19980105,,PLXS,10032,Q,A,,,14.93750,14.75000,14.93750,2
19980106,,PLXS,10032,Q,A,,,14.56250,14.56250,14.87500,2
20111222,,PCP,63830,N,A,,,164.07001,164.09000,164.12000,1
20111223,,PCP,63830,N,A,,,164.53000,164.53000,164.55000,1
20111227,,PCP,63830,N,A,,,165.69000,165.61000,165.64000,1
output PCP.csv
a,b,c,d,e,f,g,h,i,j,k,l
20111222,,PCP,63830,N,A,,,164.07001,164.09000,164.12000,1
20111223,,PCP,63830,N,A,,,164.53000,164.53000,164.55000,1
20111227,,PCP,63830,N,A,,,165.69000,165.61000,165.64000,1
output PLXS.csv
a,b,c,d,e,f,g,h,i,j,k,l
19980102,,PLXS,10032,Q,A,,,15.12500,15.00000,15.12500,2
19980105,,PLXS,10032,Q,A,,,14.93750,14.75000,14.93750,2
19980106,,PLXS,10032,Q,A,,,14.56250,14.56250,14.87500,2