Select only the first row when merging data frames with multiple matches
Here is another method using dplyr::distinct. It is useful if you want to keep all rows from 'data' even if there is no match.
data = data.frame(id=c(1,2,3,4,5),
state=c("KS","MN","AL","FL","CA"))
scores = data.frame(id=c(1,1,1,2,2,3,3,3),
score=c(66,75,78,86,85,76,75,90))
data %>% dplyr::left_join(dplyr::distinct(scores, id, .keep_all = T))
# Joining, by = "id"
# id state score
# 1 1 KS 66
# 2 2 MN 86
# 3 3 AL 76
# 4 4 FL NA
# 5 5 CA NA
Moreover, if you want to replace the NAs in the new data.frame, try the tidyr::replace_na() function. Example:
data %>% dplyr::left_join(dplyr::distinct(scores, id, .keep_all = T)) %>% tidyr::replace_na(replace = list("score"=0L))
# Joining, by = "id"
# id state score
# 1 1 KS 66
# 2 2 MN 86
# 3 3 AL 76
# 4 4 FL 0
# 5 5 CA 0
Here is a base R method using aggregate
and head
:
merge(data, aggregate(score ~ id, data=scores, head, 1), by="id")
The aggregate
function breaks up the scores dataframe by id, then head
is applied to get the first observation from each id. Since aggregate
returns a data.frame, this is directly merged onto the data.frame data.
Probably more efficient is to subset the scores data.frame using duplicated
which will achieve the same result as aggregate
, but will reduce the computational overhead.
merge(data, scores[!duplicated(scores$id),], by="id")
Using data.table
along with mult = "first"
and nomatch = 0L
:
require(data.table)
setDT(scores); setDT(data) # convert to data.tables by reference
scores[data, mult = "first", on = "id", nomatch=0L]
# id score state
# 1: 1 66 KS
# 2: 2 86 MN
# 3: 3 76 AL
For each row on data
's id
column, the matching rows in scores
' id
column are found, and the first one alone is retained (because mult = "first"
). If there are no matches, they're removed (because of nomatch = 0L
).