Purrr-Fection: In Search of An Elegant Solution to Conditional Data Frame Operations Leveraging Purrr

Really, you want to avoid calling geocode any more than necessary because it's slow and if you're using Google, you only have 2500 queries per day. Thus, it's best to make both columns from the same call, which can be done with a list column, making a new version of the data.frame with do, or a self-join.


1. With a list column

With a list column, you make a new version of lon and lat with ifelse, geocoding if there are NAs, else just copying the existing values. Afterwards, get rid of the old versions of the columns and unnest the new ones:

library(dplyr)
library(ggmap)
library(tidyr)    # For `unnest`

       # Evaluate each row separately
df %>% rowwise() %>% 
    # Add a list column. If lon or lat are NA,
    mutate(data = ifelse(any(is.na(c(lon, lat))), 
                         # return a data.frame of the geocoded results,
                         list(geocode(paste(Street, City, State, Zip))), 
                         # else return a data.frame of existing columns.
                         list(data_frame(lon = lon, lat = lat)))) %>% 
    # Remove old columns
    select(-lon, -lat) %>% 
    # Unnest newly created ones from list column
    unnest(data)

## # A tibble: 6 × 6
##                 Street       City    State   Zip       lon      lat
##                  <chr>      <chr>    <chr> <dbl>     <dbl>    <dbl>
## 1        226 W 46th St   New York New York 10036 -73.98670 40.75902
## 2              5th Ave   New York New York 10022 -73.97491 40.76167
## 3          75 Broadway   New York New York 10006 -74.01205 40.70814
## 4          350 5th Ave   New York New York 10118 -73.98566 40.74871
## 5  20 Sagamore Hill Rd Oyster Bay New York 11771 -73.50538 40.88259
## 6 45 Rockefeller Plaza   New York New York 10111 -73.97771 40.75915

2. With do

do, on the other hand, creates a wholly new data.frame from pieces of the old one. It requires slightly clunky $ notation, with . to represent the grouped data.frame piped in. Using if and else instead of ifelse lets you avoid nesting results in lists (which they had to be above, anyway).

       # Evaluate each row separately
df %>% rowwise() %>% 
    # Make a new data.frame from the first four columns and the geocode results or existing lon/lat
    do(bind_cols(.[1:4], if(any(is.na(c(.$lon, .$lat)))){
        geocode(paste(.[1:4], collapse = ' '))
    } else {
        .[5:6]
    }))

which returns exactly the same thing as the first version.


3. On a subset, recombining with a self-join

If the ifelse is overly confusing, you can just geocode a subset and then recombine by binding the rows to the anti_join, i.e. all the rows that are in df but not the subset .:

df %>% filter(is.na(lon) | is.na(lat)) %>% 
    select(1:4) %>% 
    bind_cols(geocode(paste(.$Street, .$City, .$State, .$Zip))) %>% 
    bind_rows(anti_join(df, ., by = c('Street', 'Zip')))

which returns the same thing, but with the newly geocoded rows at the top. The same approach works with a list column or do, but since there's no need to combine two sets of columns, just bind_cols will do the trick.


4. On a subset with mutate_geocode

ggmap actually includes a mutate_geocode function that will add lon and lat columns when passed a data.frame and a column of addresses. It has an issue: it can't accept more than a column name for the address, and thus requires a single column with the entire address. Thus, while this version could be quite nice, it requires creating and deleting an extra column with the whole address, making it inconcise:

df %>% filter(is.na(lon) | is.na(lat)) %>% 
    select(1:4) %>% 
    mutate(address = paste(Street, City, State, Zip)) %>%    # make an address column
    mutate_geocode(address) %>% 
    select(-address) %>%    # get rid of address column
    bind_rows(anti_join(df, ., by = c('Street', 'Zip')))

##                 Street       City    State   Zip       lon      lat
## 1              5th Ave   New York New York 10022 -73.97491 40.76167
## 2  20 Sagamore Hill Rd Oyster Bay New York 11771 -73.50538 40.88259
## 3 45 Rockefeller Plaza   New York New York 10111 -73.97771 40.75915
## 4          350 5th Ave   New York New York 10118 -73.98566 40.74871
## 5          75 Broadway   New York New York 10006 -74.01205 40.70814
## 6        226 W 46th St   New York New York 10036 -73.98670 40.75902

5. Base R

Base R can assign to a subset directly, which makes the idiom here much simpler, even if it requires a lot of subsetting:

df[is.na(df$lon) | is.na(df$lat), c('lon', 'lat')] <- geocode(paste(df$Street, df$City, df$State, df$Zip)[is.na(df$lon) | is.na(df$lat)])

Results are the same as the first version.


All versions only call geocode twice.

Note that while you could use purrr for the job, it's not particularly better suited than regular dplyr. purrr excels at dealing with lists, and while a list column is one option, it doesn't really have to be manipulated.


I'm not sure abut purrr but here's the following using the pipe:

df <- frame_data(
  ~Street,                ~City,        ~State,     ~Zip,  ~lon,      ~lat,
  "226 W 46th St",        "New York",   "New York", 10036, -73.9867,  40.75902,
  "5th Ave",              "New York",   "New York", 10022, NA,        NA,
  "75 Broadway",          "New York",   "New York", 10006, -74.01205, 40.70814,
  "350 5th Ave",          "New York",   "New York", 10118, -73.98566, 40.74871,
  "20 Sagamore Hill Rd",  "Oyster Bay", "New York", 11771, NA,        NA,
  "45 Rockefeller Plaza", "New York",   "New York", 10111, -73.97771, 40.75915
)

df2<-df %>%
  filter(is.na(lon) | is.na(lat)) %>%
  group_by(Street, City, State) %>% #not really necessary but it suppresses a warning
  mutate(lon=ifelse(is.na(lon) | is.na(lat), 
    geocode(paste(Street, City,State, sep=" ")), 0)) %>%
  mutate(lat=ifelse(is.na(lon) | is.na(lat), 
    rev(geocode(paste(Street, City,State, sep=" "))), 0))

If you want the partial output like in your example code above:

as.data.frame(df2)[,5:6]
       lon       lat
1 40.77505 -73.96515
2 40.88259 -73.50538

Or include all columns:

as.data.frame(df2)
              Street       City    State   Zip      lon       lat
1             5th Ave   New York New York 10022 40.77505 -73.96515
2 20 Sagamore Hill Rd Oyster Bay New York 11771 40.88259 -73.50538

And if you want to combine your original data with the new data you can do the following:

as.data.frame(rbind(filter(df, !is.na(lon) | !is.na(lat)),df2 ))
                Street       City    State   Zip       lon       lat
1        226 W 46th St   New York New York 10036 -73.98670  40.75902
2          75 Broadway   New York New York 10006 -74.01205  40.70814
3          350 5th Ave   New York New York 10118 -73.98566  40.74871
4 45 Rockefeller Plaza   New York New York 10111 -73.97771  40.75915
5              5th Ave   New York New York 10022  40.77505 -73.96515
6  20 Sagamore Hill Rd Oyster Bay New York 11771 -73.96515  40.77505

...Or you can streamline it all in one like in the following (keeps original order):

df2<-df %>%
  #group_by(Street, City, State) %>% # unescape if you want to suppress warning
  mutate(lon=ifelse(is.na(lon) | is.na(lat), 
    geocode(paste(Street, City,State, sep=" ")), lon)) %>%
  mutate(lat=ifelse(is.na(lon) | is.na(lat), 
    rev(geocode(paste(Street, City,State, sep=" "))), lat))

as.data.frame(df2)
                Street       City    State   Zip       lon       lat
1        226 W 46th St   New York New York 10036 -73.98670  40.75902
2              5th Ave   New York New York 10022 -73.98670  40.75902
3          75 Broadway   New York New York 10006 -74.01205  40.70814
4          350 5th Ave   New York New York 10118 -73.98566  40.74871
5  20 Sagamore Hill Rd Oyster Bay New York 11771  40.75902 -73.98670
6 45 Rockefeller Plaza   New York New York 10111 -73.97771  40.75915