Minus operation of data frames

I remember coming across this exact issue quite a few months back. Managed to sift through my Evernote one-liners.

Note: This is not my solution. Credit goes to whoever wrote it (whom I can't seem to find at the moment).

If you don't worry about rownames then you can do:

df1[!duplicated(rbind(df2, df1))[-seq_len(nrow(df2))], ]
#   c1 c2
# 1  a  1
# 2  b  2

Edit: A data.table solution:

dt1 <- data.table(df1, key="c1")
dt2 <- data.table(df2)
dt1[!dt2]

or better one-liner (from v1.9.6+):

setDT(df1)[!df2, on="c1"]

This returns all rows in df1 where df2$c1 doesn't have a match with df1$c1.

I prefer sqldf package:

require(sqldf)
sqldf("select * from df1 except select * from df2")

##   c1 c2
## 1  a  1
## 2  b  2

You can create identifier columnas then subset:

e.g.

df1 <- data.frame(c1=c("a","b","c","d"),c2=c(1,2,3,4), indf1 = rep("Y",4) )
df2 <- data.frame(c1=c("c","d","e","f"),c2=c(3,4,5,6),indf2 = rep("Y",4) )
merge(df1,df2)
#  c1 c2 indf1 indf2
#1  c  3     Y     Y
#2  d  4     Y     Y

bigdf <- merge(df1,df2,all=TRUE)
#  c1 c2 indf1 indf2
#1  a  1     Y  <NA>
#2  b  2     Y  <NA>
#3  c  3     Y     Y
#4  d  4     Y     Y
#5  e  5  <NA>     Y
#6  f  6  <NA>     Y

Then subset how you wish:

 bigdf[is.na(bigdf$indf1) ,]
#  c1 c2 indf1 indf2
#5  e  5  <NA>     Y
#6  f  6  <NA>     Y

 bigdf[is.na(bigdf$indf2) ,]  #<- output you requested those not in df2
#  c1 c2 indf1 indf2
#1  a  1     Y  <NA>
#2  b  2     Y  <NA>

Minus operation of data frames

Tags:

Set

R

Dataframe

Related

Recent Posts