Sum by distinct column value in R

I think the neatest way to do this is in dplyr

library(dplyr)
shop %>% 
  group_by(shop_id, shop_name, city) %>% 
  summarise_all(sum)

** Obligatory Data Table answer **

> library(data.table)
data.table 1.8.0  For help type: help("data.table")
> shop.dt <- data.table(shop)
> shop.dt[,list(sale=sum(sale), profit=sum(profit)), by='shop_id']
     shop_id sale profit
[1,]       1   26      7
[2,]       2   15      6
[3,]       3   28     14
>

Which sounds fine and good until things get bigger...

shop <- data.frame(shop_id = letters[1:10], profit=rnorm(1e7), sale=rnorm(1e7))
shop.dt <- data.table(shop)

> system.time(ddply(shop, .(shop_id), summarise, sale=sum(sale), profit=sum(profit)))
   user  system elapsed 
  4.156   1.324   5.514 
> system.time(shop.dt[,list(sale=sum(sale), profit=sum(profit)), by='shop_id'])
   user  system elapsed 
  0.728   0.108   0.840 
>

You get additional speed increases if you create the data.table with a key:

shop.dt <- data.table(shop, key='shop_id')

> system.time(shop.dt[,list(sale=sum(sale), profit=sum(profit)), by='shop_id'])
   user  system elapsed 
  0.252   0.084   0.336 
>

Here's how to use base R to speed up operations like this:

idx <- split(1:nrow(shop), shop$shop_id)
a2 <- data.frame(shop_id=sapply(idx, function(i) shop$shop_id[i[1]]),
                 sale=sapply(idx, function(i) sum(shop$sale[i])), 
                 profit=sapply(idx, function(i) sum(shop$profit[i])) )

Time reduces to 0.75 sec vs 5.70 sec for the ddply summarise version on my system.

Sum by distinct column value in R

Tags:

Unique

R

Sum

Data.Table

Related

Recent Posts