Extract the start and end position of a common identifier

With GNU datamash:

datamash -H -W -g 1,2 min 3 max 4 <input

This can be done with a classic loop to read the file or with other ways like awk , but i'm not good in awk to give you a solution based on awk. Bellow solution works ok in bash and use simple awk , grep and arrays.

With a known id (by parameter or by user input)

id="Prom_1" #Or for user input read -p "Give Id :" id
header=$(head -1 a.txt) #get the 1st line and store it as header.
data=($(grep $id a.txt)) #grep the file for given the id and fill an array
echo "$header"
echo -e "${data[0]}\t${data[1]}\t${data[2]}\t${data[-1]}" #data[-1] refers to the last element of the data array
#Output:
Id       Chr     Start   End  
Prom_1  chr1    3978952 3979193

The trick is that the array gets all the grep values separated by white space (default IFS) and thus the array looks like this :

root@debi64:# id="Prom_1";data=($(grep $id a.txt));declare -p data
declare -a data=([0]="Prom_1" [1]="chr1" [2]="3978952" [3]="3978953" [4]=$'\nProm_1' [5]="chr1" [6]="3979165" [7]="3979166" [8]=$'\nProm_1' [9]="chr1" [10]="3979192" [11]="3979193")
#declare -p command just prints out all the data of the array (keys and values)

To automatically scan the file for ids , you can use the uniq prog like this:

readarray -t ids< <(awk -F" " '{print $1}' a.txt |uniq |tail -n+2) 
#For Field separator= " " print the first field (id), print them as unique fields and store them in an array.
#Here the use of readarray is better to handle data separated by new lines.
declare -p ids
#Output: declare -a ids=([0]="Prom_1" [1]="Prom_2" [2]="Prom_3")

Combining all together:

header=$(head -1 a.txt) #get the 1st line and store it as header.
readarray -t ids< <(awk -F" " '{print $1}' a.txt |uniq |tail -n+2)
echo "$header"
for id in ${ids[@]}
do
data=($(grep $id a.txt))
echo -e "${data[0]}\t${data[1]}\t${data[2]}\t${data[-1]}"
done 

#Output 
Id       Chr     Start   End  
Prom_1  chr1    3978952 3979193
Prom_2  chr1    4379047 4379622
Prom_3  chr1    5184469 5184496

can you try this awk

$ awk 'NR==1{print; next}NR!=1{if(!($1 in Arr)){printf("\t%s\n%s\t%s\t%s",a,$1,$2,$3);Arr[$1]++}else{a=$NF}}END{printf("\t%s\n",a)}' input.txt
Id       Chr     Start   End

Prom_1  chr1    3978952 3979193
Prom_2  chr1    4379047 4379622
Prom_3  chr1    5184469 5184496

awk '
NR==1{print; next}
NR!=1{
if(!($1 in Arr))
{
       printf("\t%s\n%s\t%s\t%s",a,$1,$2,$3);Arr[$1]++;
}
else
{
    a=$NF
}
}
END{
printf("\t%s\n",a)
}' input.txt

Extract the start and end position of a common identifier

Tags:

Awk

Sed

Text Processing

Bioinformatics

Related

Recent Posts