Match 4 columns and replace 1 in 2 files

The join command will do the work of joining up matching lines from multiple files. But it has some requirements on its input files, so you'll need to make some temporary files along the way, with a few extra fields.

awk '{printf $2" "$3" "$4" "$5"%"$1"%"; $1="";print $0 "%" NR }' < file1 | sort > 1.tmp
awk '{print $1" "$4" "$5" "$6"%"$2} $5 != $6 {print $1" "$4" "$6" "$5"%"$2}' < file2 | sort > 2.tmp

join -a 1 -t % -o 1.4 2.2 1.2 1.3  1.tmp 2.tmp | sort -t % -n | awk -F %  '!$2{$2=$3}{print $2" "$4}'

Step by step

Preprocessing the first file:

awk '{printf $2" "$3" "$4" "$5"%"$1"%"; $1="";print $0 "%" NR }''

Example output:

1 118630 C T%1:118630% 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311%4

Those 4 fields, separated by %, are:

the "key" that has to be matched (input fields 2-5)
the original first column (needed in case there's no match)
the remainder of the original line
the original line number (so we can restore the file order after sort)

This output is piped through sort and into a temporary file, because join requires its inputs to have been sorted.

For the second file:

awk '{print $1" "$4" "$5" "$6"%"$2} $5 != $6 {print $1" "$4" "$6" "$5"%"$2}'

Example output:

1 118630 C T%1:118630_C_T
1 118630 T C%1:118630_C_T

As you specified that fields 5 and 6 should match either way round, a second line is printed with them swapped (provided that they aren't identical). The %-separated fields here are

the "key" to be matched
column 2

Again, the output is piped through sort and into another temporary file.

Then comes the main "join" step:

join -a 1 -t % -o 1.4 2.2 1.2 1.3  1.tmp 2.tmp

The -a 1 instructs join to keep lines from the first set when there's no match in the second. -t % sets the separator to % (rather than whitespace). The -o argument produces the following four fields of output:

file 1, column 4: the line number
file 2, column 2: replacement from file2 (when there's no match, this will be empty)
file 1, column 2: the original column1 from file1
file 1, column 3: the rest of the line from file1

Example output line:

4%1:118630_C_T%1:118630% 1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311

Then sort can restore the original file order (sort numerically, field separator %)

sort -t % -n

The final awk checks whether the "replacement" field is empty (because no match was found) and if so, uses the original column1 instead. It also discards the line number and all those %s.

awk -F % '!$2{$2=$3}{print $2" "$4}'

Final output line:

1:118630_C_T  1 118630 C T 0.99 -0.033 0.055 5.5e-01 226311

I'd do this in Perl because it has a sort function that lets us treat A T and T A as the same thing easily. For example (showing the output on all of your example files combined):

$ perl -lane 'if(!$k){$name{join("","chr".$F[0],$F[3],sort($F[4],$F[5]))}=$F[1]; }else{$var=join("", "chr".$F[1],$F[2],sort($F[3],$F[4])); $F[0]=$name{$var} if $name{$var};print join "\t", @F; } $k++ if eof' file2 file1
SNP Chr Pos EA  NEA EAF Beta    SE  Pvalue  Neff
7:10100610_G_A  7   10100610    A   G   0.0002  0.13    0.58    8.2e-01 120658
7:10100610_G_C  7   10100610    C   G   0.0013  0.1 0.13    4.4e-01 139170
10:1006107_C_G  10  1006107 C   G   1   -0.11   0.42    7.9e-01 152016
1:79137_A_T 1   79137   A   T   0.25    -0.026  0.0073  4.0e-04 231420
1:79033_A_G 1   79033   A   G   0.0047  -0.038  0.056   4.9e-01 225429
1:118630_C_T    1   118630  C   T   0.99    -0.033  0.055   5.5e-01 226311
1:533179_A_G    1   533179  A   G   1   -0.098  0.19    6.1e-01 185906

Or, slightly more legibly:

$ perl -lane 'if(!$k){
                $name{join("","chr".$F[0],$F[3],sort($F[4],$F[5]))}=$F[1]; 
              }
              else{
                $var=join("", "chr".$F[1],$F[2],sort($F[3],$F[4])); 
                $F[0]=$name{$var} if $name{$var};
                print join "\t", @F; 
             } 
            $k++ if eof' file2 file1
SNP Chr Pos EA  NEA EAF Beta    SE  Pvalue  Neff
7:10100610_G_A  7   10100610    A   G   0.0002  0.13    0.58    8.2e-01 120658
7:10100610_G_C  7   10100610    C   G   0.0013  0.1 0.13    4.4e-01 139170
10:1006107_C_G  10  1006107 C   G   1   -0.11   0.42    7.9e-01 152016
1:79137_A_T 1   79137   A   T   0.25    -0.026  0.0073  4.0e-04 231420
1:79033_A_G 1   79033   A   G   0.0047  -0.038  0.056   4.9e-01 225429

Explanation

perl -lane: the -a makes perl act like awk, automatically splitting its input into the array @F on whitespace. Since perl arrays start at 0, $F[0] will be the first field, $F[1] will be the second etc. Field N is $F[N-1]. The -n makes perl read its arguments as text files and apply the script given by -e to each line of them. The -l just removes trailing newlines from each input line and adds a newline to each print call.
$k++ if eof : this increments the the variable $k by 1 if we've reached the end of a file (eof). We can then use if(!$k) (if $k is not defined) as an equivalent to NR==FNR in awk.
if(!$k){$name{join("","chr".$F[0],$F[3],sort($F[4],$F[5]))}=$F[1];}: if this is the first file,file2, join fields 1, 4, and the sorted fields 5 and 6, into a string and use that string as the key in the hash (associative array)name. Then, save the variant's name from file2 as the value associated with that key. The sorting lets us treatA TandT Aas equivalent. I use"chr".$F[0]to deal with cases like1 123and11 23`, where the concatenation of the chromosome and position give the same number even though the chromosome is actually different.
else{ : if we're now reading the second file, file1.
$var=join("", $F[1],$F[2],sort($F[3],$F[4])); : build the key. This time using fields 2, 3, and sorting 4 and 5.
$F[0]=$name{$var} if $name{$var}; : set the 1st field to the value stored in the name hash, if there is a value for this key. The if is needed to make sure we don't change the header or any other variants that might be present in file1 but not in file2.
print join "\t", @F; : print the fields, including the change just made above.

Match 4 columns and replace 1 in 2 files

Step by step

Explanation

Tags:

Awk

Text Processing

Related

Recent Posts