Create an "index" for each element of a group with data.table

First, I'll load your sample data into R (you can't currently use dput() with data.table):

df <- read.table(header = TRUE, stringsAsFactors = FALSE, text = "
V1      V2      V3 V4 V5                 V6
1  chr1 3205901 3207317  .  - ENSMUSG00000051951
2  chr1 3206523 3207317  .  - ENSMUSG00000051951
3  chr1 3213439 3215632  .  - ENSMUSG00000051951
4  chr1 3213609 3216344  .  - ENSMUSG00000051951
5  chr1 3214482 3216968  .  - ENSMUSG00000051951
6  chr1 3421702 3421901  .  - ENSMUSG00000051951
7  chr1 3102016 3102125  .  + ENSMUSG00000064842
8  chr1 3466587 3466687  .  + ENSMUSG00000089699
9  chr1 3513405 3513553  .  + ENSMUSG00000089699
10 chr1 3054233 3054733  .  + ENSMUSG00000090025")

You can almost elegantly solve your problem with dplyr:

library(dplyr)

df %>% 
  group_by(V6, V5) %>%
  mutate(index = row_number(V2))

(I've assume V2 is the variable you want to index by - I think it's better to be explicit rather than relying on the order row of the row)

But you want a different summary for different subsets, which isn't currently easy in dplyr. One approach would be to split and then re-combine:

rbind_list(
  df %>% filter(V5 == "+") %>% mutate(index = row_number(V2)),
  df %>% filter(V5 == "-") %>% mutate(index = row_number(desc(V2)))
)

But this is going to be relatively slow since you have to make two copies of the data.

Another approach would to be use an if inside the summary:

df %>% 
  group_by(V6, V5) %>%
  mutate(index = row_number(if (V5[1] == "+") V2 else desc(V2)))

As a fellow bioinformatician, I come across this operation quite frequently. And this is where I adore data.table's modify subset of rows by reference feature!

I'd do it like this:

dt[V5 == "+", index := 1:.N, by=V6]
dt[V5 == "-", index := .N:1, by=V6]

No functions required. This is a little more advantageous because it avoids having to check for == "+" or "-" once for every group! Instead, you can first subset all groups with + once and then group by V6 and modify just those rows in place!

Similarly you do it once again for "-". Hope that helps.

Note: .N is a special variable that contains the number of observations per group.

Create an "index" for each element of a group with data.table

Tags:

Indexing

R

Bioinformatics

Plyr

Data.Table

Related

Recent Posts