How to do vlookup and fill down (like in Excel) in R?

If I understand your question correctly, here are four methods to do the equivalent of Excel's VLOOKUP and fill down using R:

# load sample data from Q
hous <- read.table(header = TRUE, 
                   stringsAsFactors = FALSE, 
text="HouseType HouseTypeNo
Semi            1
Single          2
Row             3
Single          2
Apartment       4
Apartment       4
Row             3")

# create a toy large table with a 'HouseType' column 
# but no 'HouseTypeNo' column (yet)
largetable <- data.frame(HouseType = as.character(sample(unique(hous$HouseType), 1000, replace = TRUE)), stringsAsFactors = FALSE)

# create a lookup table to get the numbers to fill
# the large table
lookup <- unique(hous)
  HouseType HouseTypeNo
1      Semi           1
2    Single           2
3       Row           3
5 Apartment           4

Here are four methods to fill the HouseTypeNo in the largetable using the values in the lookup table:

First with merge in base:

# 1. using base 
base1 <- (merge(lookup, largetable, by = 'HouseType'))

A second method with named vectors in base:

# 2. using base and a named vector
housenames <- as.numeric(1:length(unique(hous$HouseType)))
names(housenames) <- unique(hous$HouseType)

base2 <- data.frame(HouseType = largetable$HouseType,
                    HouseTypeNo = (housenames[largetable$HouseType]))

Third, using the plyr package:

# 3. using the plyr package
library(plyr)
plyr1 <- join(largetable, lookup, by = "HouseType")

Fourth, using the sqldf package

# 4. using the sqldf package
library(sqldf)
sqldf1 <- sqldf("SELECT largetable.HouseType, lookup.HouseTypeNo
FROM largetable
INNER JOIN lookup
ON largetable.HouseType = lookup.HouseType")

If it's possible that some house types in largetable do not exist in lookup then a left join would be used:

sqldf("select * from largetable left join lookup using (HouseType)")

Corresponding changes to the other solutions would be needed too.

Is that what you wanted to do? Let me know which method you like and I'll add commentary.


I also like using qdapTools::lookup or shorthand binary operator %l%. It works identically to an Excel vlookup, but it accepts name arguments opposed to column numbers

## Replicate Ben's data:
hous <- structure(list(HouseType = c("Semi", "Single", "Row", "Single", 
    "Apartment", "Apartment", "Row"), HouseTypeNo = c(1L, 2L, 3L, 
    2L, 4L, 4L, 3L)), .Names = c("HouseType", "HouseTypeNo"), 
    class = "data.frame", row.names = c(NA, -7L))


largetable <- data.frame(HouseType = as.character(sample(unique(hous$HouseType), 
    1000, replace = TRUE)), stringsAsFactors = FALSE)


## It's this simple:
library(qdapTools)
largetable[, 1] %l% hous

I think you can also use match():

largetable$HouseTypeNo <- with(lookup,
                     HouseTypeNo[match(largetable$HouseType,
                                       HouseType)])

This still works if I scramble the order of lookup.


The poster didn't ask about looking up values if exact=FALSE, but I'm adding this as an answer for my own reference and possibly others.

If you're looking up categorical values, use the other answers.

Excel's vlookup also allows you to match match approximately for numeric values with the 4th argument(1) match=TRUE. I think of match=TRUE like looking up values on a thermometer. The default value is FALSE, which is perfect for categorical values.

If you want to match approximately (perform a lookup), R has a function called findInterval, which (as the name implies) will find the interval / bin that contains your continuous numeric value.

However, let's say that you want to findInterval for several values. You could write a loop or use an apply function. However, I've found it more efficient to take a DIY vectorized approach.

Let's say that you have a grid of values indexed by x and y:

grid <- list(x = c(-87.727, -87.723, -87.719, -87.715, -87.711), 
             y = c(41.836, 41.839, 41.843, 41.847, 41.851), 
             z = (matrix(data = c(-3.428, -3.722, -3.061, -2.554, -2.362, 
                                  -3.034, -3.925, -3.639, -3.357, -3.283, 
                                  -0.152, -1.688, -2.765, -3.084, -2.742, 
                                   1.973,  1.193, -0.354, -1.682, -1.803, 
                                   0.998,  2.863,  3.224,  1.541, -0.044), 
                         nrow = 5, ncol = 5)))

and you have some values you want to look up by x and y:

df <- data.frame(x = c(-87.723, -87.712, -87.726, -87.719, -87.722, -87.722), 
                 y = c(41.84, 41.842, 41.844, 41.849, 41.838, 41.842), 
                 id = c("a", "b", "c", "d", "e", "f")

Here is the example visualized:

contour(grid)
points(df$x, df$y, pch=df$id, col="blue", cex=1.2)

Contour Plot

You can find the x intervals and y intervals with this type of formula:

xrng <- range(grid$x)
xbins <- length(grid$x) -1
yrng <- range(grid$y)
ybins <- length(grid$y) -1
df$ix <- trunc( (df$x - min(xrng)) / diff(xrng) * (xbins)) + 1
df$iy <- trunc( (df$y - min(yrng)) / diff(yrng) * (ybins)) + 1

You could take it one step further and perform a (simplistic) interpolation on the z values in grid like this:

df$z <- with(df, (grid$z[cbind(ix, iy)] + 
                      grid$z[cbind(ix + 1, iy)] +
                      grid$z[cbind(ix, iy + 1)] + 
                      grid$z[cbind(ix + 1, iy + 1)]) / 4)

Which gives you these values:

contour(grid, xlim = range(c(grid$x, df$x)), ylim = range(c(grid$y, df$y)))
points(df$x, df$y, pch=df$id, col="blue", cex=1.2)
text(df$x + .001, df$y, lab=round(df$z, 2), col="blue", cex=1)

Contour plot with values

df
#         x      y id ix iy        z
# 1 -87.723 41.840  a  2  2 -3.00425
# 2 -87.712 41.842  b  4  2 -3.11650
# 3 -87.726 41.844  c  1  3  0.33150
# 4 -87.719 41.849  d  3  4  0.68225
# 6 -87.722 41.838  e  2  1 -3.58675
# 7 -87.722 41.842  f  2  2 -3.00425

Note that ix, and iy could have also been found with a loop using findInterval, e.g. here's one example for the second row

findInterval(df$x[2], grid$x)
# 4
findInterval(df$y[2], grid$y)
# 2

Which matches ix and iy in df[2]

Footnote: (1) The fourth argument of vlookup was previously called "match", but after they introduced the ribbon it was renamed to "[range_lookup]".

Tags:

Lookup

R