How to sort by odd lines then remove repeated values?

Pure gawk solution:

awk -F_ 'NR%2{i=$2;next}{a[i]=a[i]"\n"$0}
         END{PROCINFO["sorted_in"]="@ind_num_asc";
             for(i in a) printf "%s","transcr_"i""a[i]"\n"}' file

The trick is to sort indexes of array a numerically with a little help of gawk's PROCINFO special array.

transcr_7135
YBL029C-A -
YBL029W +
transcr_11317
YBL067C -
transcr_20649
YBL100C -
transcr_25793
YAL039C -
YAL037C-B -
YAL038W +

BTW, its a pity awk doesn't offer an option to sort naturally a.k.a. version sort (according to text with numbers).

Not exactly the sorting order you've showed, but maby right as well?

$ cat input.txt|paste - -| sort -k1,1V -k2,2| tr "\t" "\n" | awk '{if($0 in line == 0) {line[$0]; print}}'
    transcr_7135 +
    YBL029C-A -
    YBL029W +
    transcr_11317 +
    YBL067C -
    transcr_20649 +
    YBL100C -
    transcr_25793 +
    YAL037C-B -
    YAL038W +
    YAL039C -

EDIT:

Insert the line number and uses it as a sorting key, should produce the exact output you like:

$ cat input.txt | paste - - | nl | sort -k2,2V -k1,1g | cut -f2- | tr "\t" "\n" | awk '{if($0 in line == 0) {line[$0]; print}}'

With GNU sort and assuming the lines don't contain TAB characters:

paste - - < file | sort -V | tr '\t' '\n' | awk '!seen[$0]++'

Or sort -t$'\t' -sk1,1V to preserve the original order for entries with identical odd lines like in your expected output.

If you don't have GNU sort, and assuming the odd lines always follow that pattern, you can replace sort -V with sort -k1.9n.

How to sort by odd lines then remove repeated values?

Tags:

Text Processing

Sort

Related

Recent Posts