Extract subsequence corresponding to n:th pattern from a file
Using awk
:
$ awk -f script.awk file
Sequence: CTCACAC, Anticodon: CAC, Type: Val
Sequence: CTGAAGA, Anticodon: GAA, Type: Phe
Sequence: CTGCCAC, Anticodon: GCC, Type: Gly
Sequence: TTTACAC, Anticodon: TAC, Type: Val
Sequence: CTGATAA, Anticodon: GAT, Type: Ile
Sequence: CTGATAA, Anticodon: GAT, Type: Ile
Sequence: CTGATAA, Anticodon: GAT, Type: Ile
Sequence: CTTCAAA, Anticodon: TCA, Type: SeC
Where script.awk
is the following awk
program:
/^Type:/ {
type = $2
anticodon = $4
split($6, pos, "-")
}
/^Seq:/ {
seq = substr($2, pos[1]-2, length(anticodon) + 4)
# or: seq = substr($2, pos[1]-2, pos[2]-pos[1]+5)
printf "Sequence: %s, Anticodon: %s, Type: %s\n", seq, anticodon, type
}
The first block is triggered by any line starting with the string Type:
and it picks out the type and anticodon sequence from the 2nd and 4th whitespace-delimited fields and splits the 6th such field on -
to produce the start and end coordinates in the sequence.
The second block is triggered by a line starting with the string Seq:
and it picks out the sequence from the 2nd whitespace-delimited field using the start position of the anticodon and the anticodon's length read from the latest Type:
line, making sure to get a couple of base-pairs on either side.
The output is then produced.
The following sed
script uses the 8th "pattern" from the Str:
line to extract the wanted sequence rather than the numerical positions for the anticodon given on the Type:
line.
/^Type:[[:blank:]]*/ {
s/.*Type: \([^[:blank:]]*\)[[:blank:]]*Anticodon: \([^[:blank:]]*\).*/ Anticodon: \2, Type: \1/
h
}
/^Seq:[[:blank:]]*/ {
s//Sequence: /
G
y/\n/,/
w data.tmp
}
/^Str:[[:blank:]]*/ {
s///
s,\(\(\([<>.]\)\3*\)\{7\}\)\(\([<>.]\)\5*\).*,s/: \1\\(\4\\)[^\,]*/: \\1/;n,
y/<>/../
w pass2.sed
}
d
(the trailing d
is not a typo).
It does so in two passes.
In the first pass, two new files are created, data.tmp
and pass2.sed
.
$ sed -f script.sed file
(there is no terminal output from this)
For the given data, data.tmp
will look like
Sequence: GTTTCCGTAGTGTAGCGGTtATCACATTCGCCTCACACGCGAAAGGtCCCCGGTTCGATCCCGGGCGGAAACA, Anticodon: CAC, Type: Val
Sequence: GCCGAAATAGCTCAGTTGGGAGAGCGTTAGACTGAAGATCTAAAGGtCCCTGGTTCGATCCCGGGTTTCGGCA, Anticodon: GAA, Type: Phe
Sequence: GCATGGGTGGTTCAGTGGTAGAATTCTCGCCTGCCACGCGGGAGGCCCGGGTTCGATTCCCGGCCCATGCA, Anticodon: GCC, Type: Gly
Sequence: GGTTCCATAGTGTAGTGGTtATCACGTCTGCTTTACACGCAGAAGGtCCTGGGTTCGAGCCCCAGTGGAACCA, Anticodon: TAC, Type: Val
Sequence: GGCCGGTTAGCTCAGTTGGTaAGAGCGTGGTGCTGATAACACCAAGGtCGCGGGCTCGACTCCCGCACCGGCCA, Anticodon: GAT, Type: Ile
Sequence: GGCCGGTTAGCTCAGTTGGTaAGAGCGTGGTGCTGATAACACCAAGGtCGCGGGCTCGACTCCCGCACCGGCCA, Anticodon: GAT, Type: Ile
Sequence: GGCCGGTTAGCTCAGTTGGTaAGAGCGTGGTGCTGATAACACCAAGGtCGCGGGCTCGACTCCCGCACCGGCCA, Anticodon: GAT, Type: Ile
Sequence: GCCCGGATGATCCTCAGTGGTCTGGGGTGCAGGCTTCAAACCTGTAGCTGTCTAGCGACAGAGTGGTTCAATTCCACCTTTCGGGCG, Anticodon: TCA, Type: SeC
while pass2.sed
is a sed
script that post-processes this:
s/: ...............................\(.......\)[^,]*/: \1/;n
s/: ...............................\(.......\)[^,]*/: \1/;n
s/: ..............................\(.......\)[^,]*/: \1/;n
s/: ...............................\(.......\)[^,]*/: \1/;n
s/: ................................\(.......\)[^,]*/: \1/;n
s/: ................................\(.......\)[^,]*/: \1/;n
s/: ................................\(.......\)[^,]*/: \1/;n
s/: .................................\(.......\)[^,]*/: \1/;n
Applying pass2.sed
onto data.sed
gives you the final result:
$ sed -f pass2.sed data.tmp
Sequence: CTCACAC, Anticodon: CAC, Type: Val
Sequence: CTGAAGA, Anticodon: GAA, Type: Phe
Sequence: CTGCCAC, Anticodon: GCC, Type: Gly
Sequence: TTTACAC, Anticodon: TAC, Type: Val
Sequence: CTGATAA, Anticodon: GAT, Type: Ile
Sequence: CTGATAA, Anticodon: GAT, Type: Ile
Sequence: CTGATAA, Anticodon: GAT, Type: Ile
Sequence: CTTCAAA, Anticodon: TCA, Type: SeC
Note: I'm not sure how the second step performs on very large datasets.
Given that we can extract the starting index together with the anticodon:
len=7
prior=2
while IFS= read -r line; do
if [[ $line =~ Anticodon:" "([[:alpha:]]+)" at "([0-9]+) ]]; then
anticodon=${BASH_REMATCH[1]}
start=$(( BASH_REMATCH[2] - 1)) # string indexing is zero-based
elif [[ $line == "Seq: "* ]]; then
seq=${line#Seq: }
printf "Seq: %s, Anticodon: %s\n" "${seq:start-prior:len}" "$anticodon"
fi
done < file
A more complex solution that parses the "Str:" line each time, but does not hardcode the length as 7 (it does hardcode the "nth" pattern):
8thSeq() {
local seq=$1 str=$2
local last=${str:0:1}
local nth=8 n=1 start
for (( i=1; i < ${#str}; i++)); do
if [[ "${str:i:1}" != "$last" ]]; then
((n++))
if ((n == nth)); then
start=$i
elif ((n == nth+1)); then
echo "${seq:start:i-start}"
break
fi
fi
last=${str:i:1}
done
}
while IFS= read -r line; do
if [[ $line =~ Anticodon:" "([[:alpha:]]+) ]]; then
anticodon=${BASH_REMATCH[1]}
elif [[ $line == "Seq: "* ]]; then
seq=${line#Seq: }
elif [[ $line == "Str: "* ]]; then
str=${line#Str: }
printf "Seq: %s, Anticodon: %s\n" "$(8thSeq "$seq" "$str")" "$anticodon"
fi
done < file
Using the "more" data, both solutions output
Seq: CTCACAC, Anticodon: CAC
Seq: CTGAAGA, Anticodon: GAA
Seq: CTGCCAC, Anticodon: GCC
Seq: TTTACAC, Anticodon: TAC
Seq: CTGATAA, Anticodon: GAT
Seq: CTGATAA, Anticodon: GAT
Seq: CTGATAA, Anticodon: GAT
Seq: CTTCAAA, Anticodon: TCA
Assuming that you need to parse the repetitions of the Str string:
start and end
Since the sequence of patterns could change for each block we need a way to find the 8th pattern.
It is possible to extract each repeated "pattern" (from your description anything that starts with a character and stops with same character) from the str with (GNU) grep:
$ str='>>>>>>>..>>>>.......<<<<.>>>>>.......<<<<<....>>>>>.......<<<<<<<<<<<<.'
$ grep -Eo '(.)\1+' <<<"$str"
>>>>>>>
..
>>>>
.......
<<<<
>>>>>
.......
<<<<<
....
>>>>>
.......
<<<<<<<<<<<<
So, the start and length of the 8
pattern (using the shell) is:
pattern=8
splitstr=( $(grep -Eo '(.)\1+' <<<"$str") )
for((i=1;i<=pattern-2;i++)); do
start=$((start+${#splistr[i]}))
done
len=${splitstr[pattern-1]}
For any pattern (that has 8 or more repetitions).
Or, shorter, start and end:
start=$(echo "$str" | grep -Eo '^((.)\2+|.){7}'); start=${#start}
end=$(echo "$str" | grep -Eo '^((.)\2+|.){8}'); end=${#end}
blocks
In AWK: It is possible (and simple) to break the file into blocks (lines separated by an empty line) by setting RS
to empty ""
.
fields
If RS
is ""
each block is further divided into fields automatically by awk. Being the last field ($NF
in awk parlance) the str that contains repeated characters.
So, in awk:
$ awk -vRS="" '{str=$NF; pat=8
cmd1="echo \"" str "\" | grep -Eo '\''^((.)\\2+|.){" pat-1 "}'\''";
cmd2="echo \"" str "\" | grep -Eo '\''^((.)\\2+|.){" pat "}'\''";
cmd1 | getline start ; close(cmd1) ; start=length(start)
cmd2 | getline end ; close(cmd2) ; end=length(end)
print "Start:",start,"End:",end,"Sequence:",substr($(NF-2),start,end-start),"Anticodon:",$9,"Type:",$7
}' biopattern.txt
Start: 30 End: 37 Sequence: CCTCCCA Anticodon: CCC Type: Gly
Start: 31 End: 38 Sequence: CCTCACA Anticodon: CAC Type: Val
Start: 31 End: 38 Sequence: ACTGAAG Anticodon: GAA Type: Phe
Start: 30 End: 37 Sequence: CCTGCCA Anticodon: GCC Type: Gly
Start: 31 End: 38 Sequence: CTTTACA Anticodon: TAC Type: Val
Start: 32 End: 39 Sequence: GCTGATA Anticodon: GAT Type: Ile
Start: 32 End: 39 Sequence: GCTGATA Anticodon: GAT Type: Ile
Start: 32 End: 39 Sequence: GCTGATA Anticodon: GAT Type: Ile
Start: 33 End: 40 Sequence: GCTTCAA Anticodon: TCA Type: SeC
Which are not the same results of other answers based on the number after at
.
Maybe: Is this what you meant?