Compare two files on specific columns only line by line
Basically you are doing line by line comparison on two files and on specific columns excluding some columns; that, all you can do with GNU awk
for the word-boundaries support \<
& \>
:
awk -F, -v skip='2,4,7' 'BEGIN{ filetwo=ARGV[1]; ARGV[1]=""; };{
getline lf2 <filetwo; split(lf2, arr, ",");
for (i=1; i<=NF; i++) {
if ( (skip !~ "\\<"i"\\>") && $i!=arr[i] ) {
print "Line#"FNR, "Column#" i " is different in two files."; mismatch=1; };
};
}; mismatch { print $0; print lf2; mismatch=0; };' file2 file1
Or in any awk
versions:
awk -F, -v skip_cols='2,4,7' '
BEGIN{ filetwo=ARGV[1]; ARGV[1]=""; split(skip_cols, skip, ","); };{
getline lf2 <filetwo; split(lf2, arr, ",");
for (i=1; i<=NF; i++) {
if ( !(i in skip) && $i!=arr[i] ) {
print "Line#"FNR, "Column#" i " is different in two files."; mismatch=1; };
};
}; mismatch { print $0; print lf2; mismatch=0; };' file2 file1
explaining the code:
The
BEGIN { ... }
block:
this execute at very first and once beforeawk
want to read any input.- Using ARGV,
filetwo=ARGV[1];
:
read second argument passed to the command (that isfile2
) and save that intofiletwo
variable; first argumentARGV[0]
is awk itself and the third oneARGV[2]
isfile1
. - after we read the parameter's value, with
ARGV[1]=""
we unset its value, so awk will not found that parameter (file2
) for reading. skip="2 4 7";
:
we defined a variable (see Assignment Expressions)skip
and set with the columns number we want to ignore the later.
- Using ARGV,
getline
command- see Usinggetline
into a Variable from a File:
we are reading a line from the file2 and assign it to variablelf2
(note the above thatfiletwo
variable now contains the name of the second argument we read fromARGV[1]
)split()
function:
we split the line we read from file2 which is inlf2
variable on comma character,
and store in into array calledarr.
; now every fields of that line addressed byarr[1]
(first field),arr[2]
(second field),arr[3]
(third), etc.Within
for-loop
statement we checks two things below:- The value of variable
i
that indicates column number is not seen! ~
withinskip
variable value (skip !~ "\\<"i"\\>"
;\<
and\>
are word boundaries anchors, GNUawk
specific, soi=2
will not match on22
); next - checking that value of column from file1 is not equal with the same column of file2 with same indexes:
$i!=arr[i]
; if those were not same print the mismatched line numberFNR
and the diff column indexi
and also set a control variablemismatch=1
.
- The value of variable
mismatch { print ... }
: print both lines from file1 followed by line from file2 inlf2
only if mismatch was detected andmismatch
variable was set withinif
statement; and reset the variablemismatch=0
for next line.
If I understand correctly :
- you want to do a for loop on all fields : for(i=1;i<=NF;i++) { ... }
- and inside: you want to SKIP when i is one of 4 values (in awk, "continue" will bypass the rest of the current for loop and go to the next iteration
A simple way: if you want to be able to skip fields, you can do this by using the following technique
BEGIN { skip[2]++; skip[3]++; skip[22]++; skip[23]++ }
....
for(i=1;i<=NF;i++) {
if (i in skip) { continue ; rem="Will skip for values defined in skip array indexes" }
...
Instead of defining "skip" from a BEGIN section, you could also have a file with the 4 indexes to be skipped (1 on each line), and read that file using the NR==FNR condition, populating the skip array with this, and then when NR!=FNR (when reading the source file) you use the above method to skip those fields.