Extract subsequence corresponding to n:th pattern from a file

Using awk:

$ awk -f script.awk file
Sequence: CTCACAC, Anticodon: CAC, Type: Val
Sequence: CTGAAGA, Anticodon: GAA, Type: Phe
Sequence: CTGCCAC, Anticodon: GCC, Type: Gly
Sequence: TTTACAC, Anticodon: TAC, Type: Val
Sequence: CTGATAA, Anticodon: GAT, Type: Ile
Sequence: CTGATAA, Anticodon: GAT, Type: Ile
Sequence: CTGATAA, Anticodon: GAT, Type: Ile
Sequence: CTTCAAA, Anticodon: TCA, Type: SeC

Where script.awk is the following awk program:

Click to copy

/^Type:/ {
        type = $2
        anticodon = $4
        split($6, pos, "-")
}

/^Seq:/ {
        seq = substr($2, pos[1]-2, length(anticodon) + 4)
        # or: seq = substr($2, pos[1]-2, pos[2]-pos[1]+5)
        printf "Sequence: %s, Anticodon: %s, Type: %s\n", seq, anticodon, type
}

The first block is triggered by any line starting with the string Type: and it picks out the type and anticodon sequence from the 2nd and 4th whitespace-delimited fields and splits the 6th such field on - to produce the start and end coordinates in the sequence.

The second block is triggered by a line starting with the string Seq: and it picks out the sequence from the 2nd whitespace-delimited field using the start position of the anticodon and the anticodon's length read from the latest Type: line, making sure to get a couple of base-pairs on either side.

The output is then produced.

The following sed script uses the 8th "pattern" from the Str: line to extract the wanted sequence rather than the numerical positions for the anticodon given on the Type: line.

Click to copy

/^Type:[[:blank:]]*/ {
        s/.*Type: \([^[:blank:]]*\)[[:blank:]]*Anticodon: \([^[:blank:]]*\).*/ Anticodon: \2, Type: \1/
        h
}

/^Seq:[[:blank:]]*/ {
        s//Sequence: /
        G
        y/\n/,/
        w data.tmp
}

/^Str:[[:blank:]]*/ {
        s///
        s,\(\(\([<>.]\)\3*\)\{7\}\)\(\([<>.]\)\5*\).*,s/: \1\\(\4\\)[^\,]*/: \\1/;n,
        y/<>/../
        w pass2.sed
}

d

(the trailing d is not a typo).

It does so in two passes.

In the first pass, two new files are created, data.tmp and pass2.sed.

Click to copy

$ sed -f script.sed file

(there is no terminal output from this)

For the given data, data.tmp will look like

Click to copy

Sequence: GTTTCCGTAGTGTAGCGGTtATCACATTCGCCTCACACGCGAAAGGtCCCCGGTTCGATCCCGGGCGGAAACA, Anticodon: CAC, Type: Val
Sequence: GCCGAAATAGCTCAGTTGGGAGAGCGTTAGACTGAAGATCTAAAGGtCCCTGGTTCGATCCCGGGTTTCGGCA, Anticodon: GAA, Type: Phe
Sequence: GCATGGGTGGTTCAGTGGTAGAATTCTCGCCTGCCACGCGGGAGGCCCGGGTTCGATTCCCGGCCCATGCA, Anticodon: GCC, Type: Gly
Sequence: GGTTCCATAGTGTAGTGGTtATCACGTCTGCTTTACACGCAGAAGGtCCTGGGTTCGAGCCCCAGTGGAACCA, Anticodon: TAC, Type: Val
Sequence: GGCCGGTTAGCTCAGTTGGTaAGAGCGTGGTGCTGATAACACCAAGGtCGCGGGCTCGACTCCCGCACCGGCCA, Anticodon: GAT, Type: Ile
Sequence: GGCCGGTTAGCTCAGTTGGTaAGAGCGTGGTGCTGATAACACCAAGGtCGCGGGCTCGACTCCCGCACCGGCCA, Anticodon: GAT, Type: Ile
Sequence: GGCCGGTTAGCTCAGTTGGTaAGAGCGTGGTGCTGATAACACCAAGGtCGCGGGCTCGACTCCCGCACCGGCCA, Anticodon: GAT, Type: Ile
Sequence: GCCCGGATGATCCTCAGTGGTCTGGGGTGCAGGCTTCAAACCTGTAGCTGTCTAGCGACAGAGTGGTTCAATTCCACCTTTCGGGCG, Anticodon: TCA, Type: SeC

while pass2.sed is a sed script that post-processes this:

Click to copy

s/: ...............................\(.......\)[^,]*/: \1/;n
s/: ...............................\(.......\)[^,]*/: \1/;n
s/: ..............................\(.......\)[^,]*/: \1/;n
s/: ...............................\(.......\)[^,]*/: \1/;n
s/: ................................\(.......\)[^,]*/: \1/;n
s/: ................................\(.......\)[^,]*/: \1/;n
s/: ................................\(.......\)[^,]*/: \1/;n
s/: .................................\(.......\)[^,]*/: \1/;n

Applying pass2.sed onto data.sed gives you the final result:

Click to copy

$ sed -f pass2.sed data.tmp
Sequence: CTCACAC, Anticodon: CAC, Type: Val
Sequence: CTGAAGA, Anticodon: GAA, Type: Phe
Sequence: CTGCCAC, Anticodon: GCC, Type: Gly
Sequence: TTTACAC, Anticodon: TAC, Type: Val
Sequence: CTGATAA, Anticodon: GAT, Type: Ile
Sequence: CTGATAA, Anticodon: GAT, Type: Ile
Sequence: CTGATAA, Anticodon: GAT, Type: Ile
Sequence: CTTCAAA, Anticodon: TCA, Type: SeC

Note: I'm not sure how the second step performs on very large datasets.

Given that we can extract the starting index together with the anticodon:

Click to copy

len=7
prior=2

while IFS= read  -r line; do
    if [[ $line =~ Anticodon:" "([[:alpha:]]+)" at "([0-9]+) ]]; then
        anticodon=${BASH_REMATCH[1]}
        start=$(( BASH_REMATCH[2] - 1))  # string indexing is zero-based
    elif [[ $line == "Seq: "* ]]; then
        seq=${line#Seq: }
        printf "Seq: %s, Anticodon: %s\n" "${seq:start-prior:len}" "$anticodon"
    fi
done < file

A more complex solution that parses the "Str:" line each time, but does not hardcode the length as 7 (it does hardcode the "nth" pattern):

Click to copy

8thSeq() {
    local seq=$1 str=$2
    local last=${str:0:1}
    local nth=8 n=1 start

    for (( i=1; i < ${#str}; i++)); do
        if [[ "${str:i:1}" != "$last" ]]; then
            ((n++))
            if ((n == nth)); then
                start=$i
            elif ((n == nth+1)); then
                echo "${seq:start:i-start}"
                break
            fi
        fi
        last=${str:i:1}
    done
}

while IFS= read  -r line; do
    if [[ $line =~ Anticodon:" "([[:alpha:]]+) ]]; then
        anticodon=${BASH_REMATCH[1]}
    elif [[ $line == "Seq: "* ]]; then
        seq=${line#Seq: }
    elif [[ $line == "Str: "* ]]; then
        str=${line#Str: }
        printf "Seq: %s, Anticodon: %s\n" "$(8thSeq "$seq" "$str")" "$anticodon"
    fi
done < file

Using the "more" data, both solutions output

Click to copy

Seq: CTCACAC, Anticodon: CAC
Seq: CTGAAGA, Anticodon: GAA
Seq: CTGCCAC, Anticodon: GCC
Seq: TTTACAC, Anticodon: TAC
Seq: CTGATAA, Anticodon: GAT
Seq: CTGATAA, Anticodon: GAT
Seq: CTGATAA, Anticodon: GAT
Seq: CTTCAAA, Anticodon: TCA

Assuming that you need to parse the repetitions of the Str string:

start and end

Since the sequence of patterns could change for each block we need a way to find the 8th pattern.

It is possible to extract each repeated "pattern" (from your description anything that starts with a character and stops with same character) from the str with (GNU) grep:

Click to copy

$ str='>>>>>>>..>>>>.......<<<<.>>>>>.......<<<<<....>>>>>.......<<<<<<<<<<<<.'

$ grep -Eo '(.)\1+' <<<"$str"
>>>>>>>
..
>>>>
.......
<<<<
>>>>>
.......
<<<<<
....
>>>>>
.......
<<<<<<<<<<<<

So, the start and length of the 8 pattern (using the shell) is:

Click to copy

pattern=8
splitstr=( $(grep -Eo '(.)\1+' <<<"$str") )
for((i=1;i<=pattern-2;i++)); do
    start=$((start+${#splistr[i]}))
done
len=${splitstr[pattern-1]}

For any pattern (that has 8 or more repetitions).

Or, shorter, start and end:

Click to copy

start=$(echo "$str" | grep -Eo '^((.)\2+|.){7}'); start=${#start}
  end=$(echo "$str" | grep -Eo '^((.)\2+|.){8}');   end=${#end}

blocks

In AWK: It is possible (and simple) to break the file into blocks (lines separated by an empty line) by setting RS to empty "".

fields

If RS is "" each block is further divided into fields automatically by awk. Being the last field ($NF in awk parlance) the str that contains repeated characters.

So, in awk:

Click to copy

$ awk -vRS="" '{str=$NF; pat=8
cmd1="echo \"" str "\" | grep -Eo '\''^((.)\\2+|.){" pat-1 "}'\''";
cmd2="echo \"" str "\" | grep -Eo '\''^((.)\\2+|.){" pat   "}'\''";
cmd1 | getline start ; close(cmd1) ; start=length(start)
cmd2 | getline end   ; close(cmd2) ;   end=length(end)
print "Start:",start,"End:",end,"Sequence:",substr($(NF-2),start,end-start),"Anticodon:",$9,"Type:",$7
}' biopattern.txt


Start: 30 End: 37 Sequence: CCTCCCA Anticodon: CCC Type: Gly
Start: 31 End: 38 Sequence: CCTCACA Anticodon: CAC Type: Val
Start: 31 End: 38 Sequence: ACTGAAG Anticodon: GAA Type: Phe
Start: 30 End: 37 Sequence: CCTGCCA Anticodon: GCC Type: Gly
Start: 31 End: 38 Sequence: CTTTACA Anticodon: TAC Type: Val
Start: 32 End: 39 Sequence: GCTGATA Anticodon: GAT Type: Ile
Start: 32 End: 39 Sequence: GCTGATA Anticodon: GAT Type: Ile
Start: 32 End: 39 Sequence: GCTGATA Anticodon: GAT Type: Ile
Start: 33 End: 40 Sequence: GCTTCAA Anticodon: TCA Type: SeC

Which are not the same results of other answers based on the number after at.

Maybe: Is this what you meant?

Extract subsequence corresponding to n:th pattern from a file

start and end

blocks

fields

Tags:

Awk

Sed

Pattern Matching

Bioinformatics

Shell Script

Related

Recent Posts