Purrr-Fection: In Search of An Elegant Solution to Conditional Data Frame Operations Leveraging Purrr
Really, you want to avoid calling geocode
any more than necessary because it's slow and if you're using Google, you only have 2500 queries per day. Thus, it's best to make both columns from the same call, which can be done with a list column, making a new version of the data.frame with do
, or a self-join.
1. With a list column
With a list column, you make a new version of lon
and lat
with ifelse
, geocoding if there are NA
s, else just copying the existing values. Afterwards, get rid of the old versions of the columns and unnest the new ones:
library(dplyr)
library(ggmap)
library(tidyr) # For `unnest`
# Evaluate each row separately
df %>% rowwise() %>%
# Add a list column. If lon or lat are NA,
mutate(data = ifelse(any(is.na(c(lon, lat))),
# return a data.frame of the geocoded results,
list(geocode(paste(Street, City, State, Zip))),
# else return a data.frame of existing columns.
list(data_frame(lon = lon, lat = lat)))) %>%
# Remove old columns
select(-lon, -lat) %>%
# Unnest newly created ones from list column
unnest(data)
## # A tibble: 6 × 6
## Street City State Zip lon lat
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 226 W 46th St New York New York 10036 -73.98670 40.75902
## 2 5th Ave New York New York 10022 -73.97491 40.76167
## 3 75 Broadway New York New York 10006 -74.01205 40.70814
## 4 350 5th Ave New York New York 10118 -73.98566 40.74871
## 5 20 Sagamore Hill Rd Oyster Bay New York 11771 -73.50538 40.88259
## 6 45 Rockefeller Plaza New York New York 10111 -73.97771 40.75915
2. With do
do
, on the other hand, creates a wholly new data.frame from pieces of the old one. It requires slightly clunky $
notation, with .
to represent the grouped data.frame piped in. Using if
and else
instead of ifelse
lets you avoid nesting results in lists (which they had to be above, anyway).
# Evaluate each row separately
df %>% rowwise() %>%
# Make a new data.frame from the first four columns and the geocode results or existing lon/lat
do(bind_cols(.[1:4], if(any(is.na(c(.$lon, .$lat)))){
geocode(paste(.[1:4], collapse = ' '))
} else {
.[5:6]
}))
which returns exactly the same thing as the first version.
3. On a subset, recombining with a self-join
If the ifelse
is overly confusing, you can just geocode a subset and then recombine by binding the rows to the anti_join
, i.e. all the rows that are in df
but not the subset .
:
df %>% filter(is.na(lon) | is.na(lat)) %>%
select(1:4) %>%
bind_cols(geocode(paste(.$Street, .$City, .$State, .$Zip))) %>%
bind_rows(anti_join(df, ., by = c('Street', 'Zip')))
which returns the same thing, but with the newly geocoded rows at the top. The same approach works with a list column or do
, but since there's no need to combine two sets of columns, just bind_cols
will do the trick.
4. On a subset with mutate_geocode
ggmap
actually includes a mutate_geocode
function that will add lon and lat columns when passed a data.frame and a column of addresses. It has an issue: it can't accept more than a column name for the address, and thus requires a single column with the entire address. Thus, while this version could be quite nice, it requires creating and deleting an extra column with the whole address, making it inconcise:
df %>% filter(is.na(lon) | is.na(lat)) %>%
select(1:4) %>%
mutate(address = paste(Street, City, State, Zip)) %>% # make an address column
mutate_geocode(address) %>%
select(-address) %>% # get rid of address column
bind_rows(anti_join(df, ., by = c('Street', 'Zip')))
## Street City State Zip lon lat
## 1 5th Ave New York New York 10022 -73.97491 40.76167
## 2 20 Sagamore Hill Rd Oyster Bay New York 11771 -73.50538 40.88259
## 3 45 Rockefeller Plaza New York New York 10111 -73.97771 40.75915
## 4 350 5th Ave New York New York 10118 -73.98566 40.74871
## 5 75 Broadway New York New York 10006 -74.01205 40.70814
## 6 226 W 46th St New York New York 10036 -73.98670 40.75902
5. Base R
Base R can assign to a subset directly, which makes the idiom here much simpler, even if it requires a lot of subsetting:
df[is.na(df$lon) | is.na(df$lat), c('lon', 'lat')] <- geocode(paste(df$Street, df$City, df$State, df$Zip)[is.na(df$lon) | is.na(df$lat)])
Results are the same as the first version.
All versions only call geocode
twice.
Note that while you could use purrr
for the job, it's not particularly better suited than regular dplyr
. purrr
excels at dealing with lists, and while a list column is one option, it doesn't really have to be manipulated.
I'm not sure abut purrr
but here's the following using the pipe:
df <- frame_data(
~Street, ~City, ~State, ~Zip, ~lon, ~lat,
"226 W 46th St", "New York", "New York", 10036, -73.9867, 40.75902,
"5th Ave", "New York", "New York", 10022, NA, NA,
"75 Broadway", "New York", "New York", 10006, -74.01205, 40.70814,
"350 5th Ave", "New York", "New York", 10118, -73.98566, 40.74871,
"20 Sagamore Hill Rd", "Oyster Bay", "New York", 11771, NA, NA,
"45 Rockefeller Plaza", "New York", "New York", 10111, -73.97771, 40.75915
)
df2<-df %>%
filter(is.na(lon) | is.na(lat)) %>%
group_by(Street, City, State) %>% #not really necessary but it suppresses a warning
mutate(lon=ifelse(is.na(lon) | is.na(lat),
geocode(paste(Street, City,State, sep=" ")), 0)) %>%
mutate(lat=ifelse(is.na(lon) | is.na(lat),
rev(geocode(paste(Street, City,State, sep=" "))), 0))
If you want the partial output like in your example code above:
as.data.frame(df2)[,5:6]
lon lat
1 40.77505 -73.96515
2 40.88259 -73.50538
Or include all columns:
as.data.frame(df2)
Street City State Zip lon lat
1 5th Ave New York New York 10022 40.77505 -73.96515
2 20 Sagamore Hill Rd Oyster Bay New York 11771 40.88259 -73.50538
And if you want to combine your original data with the new data you can do the following:
as.data.frame(rbind(filter(df, !is.na(lon) | !is.na(lat)),df2 ))
Street City State Zip lon lat
1 226 W 46th St New York New York 10036 -73.98670 40.75902
2 75 Broadway New York New York 10006 -74.01205 40.70814
3 350 5th Ave New York New York 10118 -73.98566 40.74871
4 45 Rockefeller Plaza New York New York 10111 -73.97771 40.75915
5 5th Ave New York New York 10022 40.77505 -73.96515
6 20 Sagamore Hill Rd Oyster Bay New York 11771 -73.96515 40.77505
...Or you can streamline it all in one like in the following (keeps original order):
df2<-df %>%
#group_by(Street, City, State) %>% # unescape if you want to suppress warning
mutate(lon=ifelse(is.na(lon) | is.na(lat),
geocode(paste(Street, City,State, sep=" ")), lon)) %>%
mutate(lat=ifelse(is.na(lon) | is.na(lat),
rev(geocode(paste(Street, City,State, sep=" "))), lat))
as.data.frame(df2)
Street City State Zip lon lat
1 226 W 46th St New York New York 10036 -73.98670 40.75902
2 5th Ave New York New York 10022 -73.98670 40.75902
3 75 Broadway New York New York 10006 -74.01205 40.70814
4 350 5th Ave New York New York 10118 -73.98566 40.74871
5 20 Sagamore Hill Rd Oyster Bay New York 11771 40.75902 -73.98670
6 45 Rockefeller Plaza New York New York 10111 -73.97771 40.75915