Unix/Perl/Python: substitute list on big data set
Note See the second part for a version that uses Text::CSV
module to parse files
Load mappings into a hash (dictionary), then go through your files and test each field for whether there is such a key in the hash, replace with value if there is. Write each line out to a temporary file, and when done move it into a new file (or overwrite the processed file). Any tool has to do that, more or less.
With Perl, tested with a few small made-up files
use warnings;
use strict;
use feature 'say';
use File::Copy qw(move);
my $file = shift;
die "Usage: $0 mapping-file data-files\n" if not $file or not @ARGV;
my %map;
open my $fh, '<', $file or die "Can't open $file: $!";
while (<$fh>) {
my ($key, $val) = map { s/^\s+|\s+$//gr } split /\s*,\s*/; # see Notes
$map{$key} = $val;
}
my $outfile = "tmp.outfile.txt.$$"; # use File::Temp
foreach my $file (@ARGV) {
open my $fh_out, '>', $outfile or die "Can't open $outfile: $!";
open my $fh, '<', $file or die "Can't open $file: $!";
while (<$fh>) {
s/^\s+|\s+$//g; # remove leading/trailing whitespace
my @fields = split /\s*,\s*/;
exists($map{$_}) && ($_=$map{$_}) for @fields; # see Notes
say $fh_out join ',', @fields;
}
close $fh_out;
# Change to commented out line once thoroughly tested
#move($outfile, $file) or die "can't move $outfile to $file: $!";
move($outfile, 'new_'.$file) or die "can't move $outfile: $!";
}
Notes.
The check of data against mappings is written for efficiency: We must look at each field, there's no escaping that, but then we only check for the field as a key (no regex). For this all leading/trailing spaces need be stripped. Thus this code may change whitespace in output data files; in case this is important for some reason it can of course be modified to preserve original spaces.
It came up in comments that a field in data can differ in fact, by having extra quotes. Then extract the would-be key first
for (@fields) { $_ = $map{$1} if /"?([^"]*)/ and exists $map{$1}; }
This starts the regex engine on every check, what affects efficiency. It would help to clean up that input CSV data of quotes instead, and run with the code as it is above, with no regex. This can be done by reading files using a CSV-parsing module; see comment at the end.
For Perls earlier than 5.14 replace
my ($key, $val) = map { s/^\s+|\s+$//gr } split /\s*,\s*/;
with
my ($key, $val) = map { s/^\s+|\s+$//g; $_ } split /\s*,\s*/;
since the "non-destructive"
/r
modifier was introduced only in v5.14If you'd rather that your whole operation doesn't die for one bad file, replace
or die ...
withor do { # print warning for whatever failed (warn "Can't open $file: $!";) # take care of filehandles and such if/as needed next; };
and make sure to (perhaps log and) review output.
This leaves room for some efficiency improvements, but nothing dramatic.
The data, with commas separating fields, may (or may not) be valid CSV. Since the question doesn't at all address this, and doesn't report problems, it is unlikely that any properties of the CSV data format are used in data files (delimiters embedded in data, protected quotes).
However, it's still a good idea to read these files using a module that honors full CSV, like Text::CSV. That also makes things easier, by taking care of extra spaces and quotes and handing us cleaned-up fields. So here's that -- the same as above, but using the module to parse files
use warnings;
use strict;
use feature 'say';
use File::Copy qw(move);
use Text::CSV;
my $file = shift;
die "Usage: $0 mapping-file data-files\n" if not $file or not @ARGV;
my $csv = Text::CSV->new ( { binary => 1, allow_whitespace => 1 } )
or die "Cannot use CSV: " . Text::CSV->error_diag ();
my %map;
open my $fh, '<', $file or die "Can't open $file: $!";
while (my $line = $csv->getline($fh)) {
$map{ $line->[0] } = $line->[1]
}
my $outfile = "tmp.outfile.txt.$$"; # use File::Temp
foreach my $file (@ARGV) {
open my $fh_out, '>', $outfile or die "Can't open $outfile: $!";
open my $fh, '<', $file or die "Can't open $file: $!";
while (my $line = $csv->getline($fh)) {
exists($map{$_}) && ($_=$map{$_}) for @$line;
say $fh_out join ',', @$line;
}
close $fh_out;
move($outfile, 'new_'.$file) or die "Can't move $outfile: $!";
}
Now we don't have to worry about spaces or overall quotes at all, what simplifies things a bit.
While it is difficult to reliably compare these two approaches without realistic data files, I benchmarked them for (made-up) large data files that involve "similar" processing. The code using Text::CSV
for parsing runs either around the same, or (up to) 50% faster.
The constructor option allow_whitespace makes it remove extra spaces, perhaps contrary to what the name may imply, as I do by hand above. (Also see allow_loose_quotes
and related options.) There is far more, see docs. The Text::CSV
defaults to Text::CSV_XS, if installed.
You're doing 13,491 gsub()
s on every one of your 500,000 input lines - that's almost 7 billion full-line regexp search/replaces total. So yes, that would take some time and it's almost certainly corrupting your data in ways you just haven't noticed as the result of one gsub() gets changed by the next gsub() and/or you get partial replacements!
I saw in a comment that some of your fields can be surrounded by double quotes. If those fields can't contain commas or newlines and assuming you want full string matches then this is how to write it:
$ cat tst.awk
BEGIN { FS=OFS="," }
NR==FNR {
map[$1] = $2
map["\""$1"\""] = "\""$2"\""
next
}
{
for (i=1; i<=NF; i++) {
if ($i in map) {
$i = map[$i]
}
}
print
}
I tested the above on a mapping file with 13,500 entries and an input file of 500,000 lines with multiple matches on most lines in cygwin on my underpowered laptop and it completed in about 1 second:
$ wc -l mapping.txt
13500 mapping.txt
$ wc -l file500k
500000 file500k
$ time awk -f tst.awk mapping.txt file500k > /dev/null
real 0m1.138s
user 0m1.109s
sys 0m0.015s
If that doesn't do exactly what you want efficiently then please edit your question to provide a MCVE and clearer requirements, see my comment under your question.