Is there a _merge indicator available after a merge?

Stata similarly creates a new variable _merge when doing any type of merge or join. I too find it helpful to have as an option in order to diagnose a merge quickly after performing it.

For the last few months I've been using basic functions I wrote that simply embellish the dplyr joins. There are probably more efficient ways of doing this, but here is an example of one that embellishes full_join. If you set the option .merge = T you'll get a variable, called .merge that is similar to _merge in Stata or Pandas. (This also just prints off a diagnostic message about how many cases matched and didn't match each time you use it.) I know you already have an answer to the question, but if you want a function you can use repeatedly that works identically to full_join in dplyr, here is a start. You obviously need dplyr loaded to make this work...

full_join_track <- function(x, y, by = NULL, suffix = c(".x", ".y"),
                        .merge = FALSE, ...){

# Checking to make sure used variable names are not already in use
if(".x_tracker" %in% names(x)){
    message("Warning: variable .x_tracker in left data was dropped")
}
if(".y_tracker" %in% names(y)){
    message("Warning: variable .y_tracker in right data was dropped")
}
if(.merge & (".merge" %in% names(x) | ".merge" %in% names(y))){
    stop("Variable .merge already exists; change name before proceeding")
}

# Adding simple merge tracker variables to data frames
x[, ".x_tracker"] <- 1
y[, ".y_tracker"] <- 1

# Doing full join
joined <- full_join(x, y, by = by, suffix = suffix,  ...)

# Calculating merge diagnoses 
matched <- joined %>%
    filter(!is.na(.x_tracker) & !is.na(.y_tracker)) %>%
    NROW()
unmatched_x <- joined %>%
    filter(!is.na(.x_tracker) & is.na(.y_tracker)) %>%
    NROW()
unmatched_y <- joined %>%
    filter(is.na(.x_tracker) & !is.na(.y_tracker)) %>%
    NROW()

# Print merge diagnoses
message(
    unmatched_x, " Rows ONLY from left data frame", "\n",
    unmatched_y, " Rows ONLY from right data frame", "\n",
    matched, " Rows matched"
)

# Create .merge variable if specified
if(.merge){
    joined <- joined %>%
        mutate(.merge = 
                   case_when(
                       !is.na(.$.x_tracker) & is.na(.$.y_tracker) ~ "left_only",
                       is.na(.$.x_tracker) & !is.na(.$.y_tracker) ~ "right_only",
                       TRUE ~ "matched"
                       )
               )
}

# Dropping tracker variables and returning data frame
joined <- joined %>%
    select(-.x_tracker, -.y_tracker)
return(joined)
}

As an example:

data1 <- data.frame(x = 1:10, y = rnorm(10))
data2 <- data.frame(x = 4:20, z = rnorm(17))
full_join_track(data1, data2, .merge = T)

We create the 'merge' column based on inner_join, anti_join and then bind the rows with bind_rows

d1 <- inner_join(df1, df2, by = c('key1' = 'key2')) %>%
                    mutate(merge = "both")  
bind_rows(d1, anti_join(df1, df2, by = c('key1' = 'key2')) %>% 
             mutate(merge = 'left_only'))

Is there a _merge indicator available after a merge?

Tags:

R

Dplyr

Related

Recent Posts