How to merge two data.table by different column names?
You can also merge using multiple columns having different names. see example below
# create data frame authors
authors <- data.frame(
FirstName=c("Lorne", "Loren", "Robin",
"Robin", "Billy"),
LastName=c("Green", "Jaye", "Green",
"Howe", "Jaye"),
Age=c(82, 40, 45, 2, 40),
Income=c(1200000, 40000, 25000, 0, 27500),
Home=c("California", "Washington", "Washington",
"Alberta", "Washington"))
# create data frame books Note First name in authors is same as AuthorFirstname same thing with lastname.
books <- data.frame(
AuthorFirstName=c("Lorne", "Loren", "Loren",
"Loren", "Robin", "Rich"),
AuthorLastName=c("Green", "Jaye", "Jaye", "Jaye",
"Green", "Calaway"),
Book=c("Bonanza", "Midwifery", "Gardening",
"Perennials", "Who_dun_it?", "Support"))
merge(authors, books, by.x=c("FirstName", "LastName"),
by.y=c("AuthorFirstName", "AuthorLastName"),
all.x=TRUE)
As of data.table
version 1.9.6 (on CRAN on sep 2015) you can specify the by.x
and by.y
arguments in data.table::merge
merge(x=X, y=Y, by.x="id", by.y="ID")[]
# id area value price sales
#1: c001 US 100 500 20
#2: c002 UK 200 200 30
#3: c003 EU 300 400 15
However, in data.table 1.9.6 you can also specfy the on
argument in the X[Y]
notation
X[Y] syntax can now join without having to set keys by using the new on argument. For example: DT1[DT2, on=c(x = "y")] would join column "y" of DT2 with "x" of DT1. DT1[DT2, on="y"] would join column "y" of both data.tables.
X[Y, on=c(id = "ID")]
# area id value price sales
#1: US c001 100 500 20
#2: UK c002 200 200 30
#3: EU c003 300 400 15
this answer by the data.table
author has more details
Merge fails when you use by.x
and by.y
with data.table
. Taking your data:
> merge(X,Y, by.x='id', by.y='ID')
Error in merge.data.table(X, Y, by.x = "id", by.y = "ID")
You can use data.table
with merge , but you need to use by
argument for joining (so rename the columns to have the same colnames
)
Y = setNames(Y,c('id','price','sales'))
This will still not work:
merge(X,Y, by.x='id', by.y='id')
Error in merge.data.table(X, Y, by.x = "id", by.y = "id") :
But this will work:
> merge(X,Y, by='id')
# id area value price sales
#1: c001 US 100 500 20
#2: c002 UK 200 200 30
#3: c003 EU 300 400 15
Alternatively, you would need to convert data.table
to data.frame
in order to use merge
with by.x
and by.y
arguments:
merge(data.frame(X), data.frame(Y), by.x='id', by.y='ID')
OUTDATED
Use this operation:
X[Y]
# area id value price sales
# 1: US c001 100 500 20
# 2: UK c002 200 200 30
# 3: EU c003 300 400 15
or this operation:
Y[X]
# ID price sales area value
# 1: c001 500 20 US 100
# 2: c002 200 30 UK 200
# 3: c003 400 15 EU 300
Edit after you edited your question, I read Section 1.12 of the FAQ: "What is the didifference between X[Y] and merge(X,Y)?", which led me to checkout ?merge
and I discovered there are two different merge functions depending upon which package you are using. The default is merge.data.frame
but data.table uses merge.data.table
. Compare
merge(X, Y, by.x = "id", by.y = "ID") # which is merge.data.table
# Error in merge.data.table(X, Y, by.x = "id", by.y = "ID") :
# A non-empty vector of column names for `by` is required.
with
merge.data.frame(X, Y, by.x = "id", by.y = "ID")
# id area value price sales
# 1 c001 US 100 500 20
# 2 c002 UK 200 200 30
# 3 c003 EU 300 400 15
Edit for completeness based upon a comment by @Michael Bernsteiner, it looks like the data.table
team is planning on implementing by.x
and by.y
into the merge.data.table
function, but hasn't done so yet.