How to sort by odd lines then remove repeated values?
Pure gawk
solution:
awk -F_ 'NR%2{i=$2;next}{a[i]=a[i]"\n"$0}
END{PROCINFO["sorted_in"]="@ind_num_asc";
for(i in a) printf "%s","transcr_"i""a[i]"\n"}' file
The trick is to sort indexes of array a
numerically with a little help of gawk
's PROCINFO special array.
transcr_7135
YBL029C-A -
YBL029W +
transcr_11317
YBL067C -
transcr_20649
YBL100C -
transcr_25793
YAL039C -
YAL037C-B -
YAL038W +
BTW, its a pity awk doesn't offer an option to sort naturally a.k.a. version sort (according to text with numbers).
Not exactly the sorting order you've showed, but maby right as well?
$ cat input.txt|paste - -| sort -k1,1V -k2,2| tr "\t" "\n" | awk '{if($0 in line == 0) {line[$0]; print}}'
transcr_7135 +
YBL029C-A -
YBL029W +
transcr_11317 +
YBL067C -
transcr_20649 +
YBL100C -
transcr_25793 +
YAL037C-B -
YAL038W +
YAL039C -
EDIT:
Insert the line number and uses it as a sorting key, should produce the exact output you like:
$ cat input.txt | paste - - | nl | sort -k2,2V -k1,1g | cut -f2- | tr "\t" "\n" | awk '{if($0 in line == 0) {line[$0]; print}}'
With GNU sort
and assuming the lines don't contain TAB characters:
paste - - < file | sort -V | tr '\t' '\n' | awk '!seen[$0]++'
Or sort -t$'\t' -sk1,1V
to preserve the original order for entries with identical odd lines like in your expected output.
If you don't have GNU sort
, and assuming the odd lines always follow that pattern, you can replace sort -V
with sort -k1.9n
.