Match 4 columns and replace 1 in 2 files
The join
command will do the work of joining up matching lines from multiple files. But it has some requirements on its input files, so you'll need to make some temporary files along the way, with a few extra fields.
awk '{printf $2" "$3" "$4" "$5"%"$1"%"; $1="";print $0 "%" NR }' < file1 | sort > 1.tmp
awk '{print $1" "$4" "$5" "$6"%"$2} $5 != $6 {print $1" "$4" "$6" "$5"%"$2}' < file2 | sort > 2.tmp
join -a 1 -t % -o 1.4 2.2 1.2 1.3 1.tmp 2.tmp | sort -t % -n | awk -F % '!$2{$2=$3}{print $2" "$4}'
Step by step
Preprocessing the first file:
awk '{printf $2" "$3" "$4" "$5"%"$1"%"; $1="";print $0 "%" NR }''
Example output:
1 118630 C T%1:118630% 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311%4
Those 4 fields, separated by %
, are:
- the "key" that has to be matched (input fields 2-5)
- the original first column (needed in case there's no match)
- the remainder of the original line
- the original line number (so we can restore the file order after
sort
)
This output is piped through sort
and into a temporary file, because join
requires its inputs to have been sorted.
For the second file:
awk '{print $1" "$4" "$5" "$6"%"$2} $5 != $6 {print $1" "$4" "$6" "$5"%"$2}'
Example output:
1 118630 C T%1:118630_C_T
1 118630 T C%1:118630_C_T
As you specified that fields 5 and 6 should match either way round, a second line is printed with them swapped (provided that they aren't identical). The %
-separated fields here are
- the "key" to be matched
- column 2
Again, the output is piped through sort
and into another temporary file.
Then comes the main "join" step:
join -a 1 -t % -o 1.4 2.2 1.2 1.3 1.tmp 2.tmp
The -a 1
instructs join
to keep lines from the first set when there's no match in the second. -t %
sets the separator to %
(rather than whitespace). The -o
argument produces the following four fields of output:
- file 1, column 4: the line number
- file 2, column 2: replacement from
file2
(when there's no match, this will be empty) - file 1, column 2: the original column1 from
file1
- file 1, column 3: the rest of the line from
file1
Example output line:
4%1:118630_C_T%1:118630% 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
Then sort
can restore the original file order (sort numerically, field separator %
)
sort -t % -n
The final awk
checks whether the "replacement" field is empty (because no match was found) and if so, uses the original column1 instead. It also discards the line number and all those %
s.
awk -F % '!$2{$2=$3}{print $2" "$4}'
Final output line:
1:118630_C_T 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
I'd do this in Perl because it has a sort
function that lets us treat A T
and T A
as the same thing easily. For example (showing the output on all of your example files combined):
$ perl -lane 'if(!$k){$name{join("","chr".$F[0],$F[3],sort($F[4],$F[5]))}=$F[1]; }else{$var=join("", "chr".$F[1],$F[2],sort($F[3],$F[4])); $F[0]=$name{$var} if $name{$var};print join "\t", @F; } $k++ if eof' file2 file1
SNP Chr Pos EA NEA EAF Beta SE Pvalue Neff
7:10100610_G_A 7 10100610 A G 0.0002 0.13 0.58 8.2e-01 120658
7:10100610_G_C 7 10100610 C G 0.0013 0.1 0.13 4.4e-01 139170
10:1006107_C_G 10 1006107 C G 1 -0.11 0.42 7.9e-01 152016
1:79137_A_T 1 79137 A T 0.25 -0.026 0.0073 4.0e-04 231420
1:79033_A_G 1 79033 A G 0.0047 -0.038 0.056 4.9e-01 225429
1:118630_C_T 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311
1:533179_A_G 1 533179 A G 1 -0.098 0.19 6.1e-01 185906
Or, slightly more legibly:
$ perl -lane 'if(!$k){
$name{join("","chr".$F[0],$F[3],sort($F[4],$F[5]))}=$F[1];
}
else{
$var=join("", "chr".$F[1],$F[2],sort($F[3],$F[4]));
$F[0]=$name{$var} if $name{$var};
print join "\t", @F;
}
$k++ if eof' file2 file1
SNP Chr Pos EA NEA EAF Beta SE Pvalue Neff
7:10100610_G_A 7 10100610 A G 0.0002 0.13 0.58 8.2e-01 120658
7:10100610_G_C 7 10100610 C G 0.0013 0.1 0.13 4.4e-01 139170
10:1006107_C_G 10 1006107 C G 1 -0.11 0.42 7.9e-01 152016
1:79137_A_T 1 79137 A T 0.25 -0.026 0.0073 4.0e-04 231420
1:79033_A_G 1 79033 A G 0.0047 -0.038 0.056 4.9e-01 225429
Explanation
perl -lane
: the-a
makes perl act like awk, automatically splitting its input into the array@F
on whitespace. Since perl arrays start at0
,$F[0]
will be the first field,$F[1]
will be the second etc. Field N is$F[N-1]
. The-n
makes perl read its arguments as text files and apply the script given by-e
to each line of them. The-l
just removes trailing newlines from each input line and adds a newline to eachprint
call.$k++ if eof
: this increments the the variable$k
by 1 if we've reached the end of a file (eof
). We can then useif(!$k)
(if $k is not defined) as an equivalent toNR==FNR
in awk.if(!$k){$name{join("","chr".$F[0],$F[3],sort($F[4],$F[5]))}=$F[1];}
: if this is the first file,
file2, join fields 1, 4, and the sorted fields 5 and 6, into a string and use that string as the key in the hash (associative array)
name. Then, save the variant's name from file2 as the value associated with that key. The sorting lets us treat
A Tand
T Aas equivalent. I use
"chr".$F[0]to deal with cases like
1 123and
11 23`, where the concatenation of the chromosome and position give the same number even though the chromosome is actually different.else{
: if we're now reading the second file,file1
.$var=join("", $F[1],$F[2],sort($F[3],$F[4]));
: build the key. This time using fields 2, 3, and sorting 4 and 5.$F[0]=$name{$var} if $name{$var};
: set the 1st field to the value stored in thename
hash, if there is a value for this key. Theif
is needed to make sure we don't change the header or any other variants that might be present infile1
but not infile2
.print join "\t", @F;
: print the fields, including the change just made above.