Efficient spatial joining for large dataset in R
First of all, performing a more efficient function could mean speed up the process on how quickly the computer can undertake that action (algorithmic efficiency). And...
Efficient R programming is the implementation of efficient programming practices in R. All languages are different, so efficient R code does not look like efficient code in another language. Many packages have been optimised for performance so, for some operations, achieving maximum computational efficiency may simply be a case of selecting the appropriate package and using it correctly. There are many ways to get the same result in R, and some are very slow. Therefore not writing slow code should be prioritized over writing fast code. (Taken from Efficient R book)
So, performing more efficient Spatial Joins for large data sets could imply write faster code. In this case, I assume that the over
function from sp
package have been optimized for performance and I won't write another function by myself or look for another R
package.
Instead of that, I will show how to go parallel in R
and speed up the process by using all the CPUs in your computer. Please try the commented and reproducible code example below:
Load data
# Load libraries
library(rgdal)
# Load data
download.file("http://gis.ices.dk/shapefiles/ICES_ecoregions.zip",
destfile = "ICES_ecoregions.zip")
unzip("ICES_ecoregions.zip")
# Read ecoregions shapefiles
ices_eco <- rgdal::readOGR(".", "ICES_ecoregions_20150113_no_land",
verbose = FALSE)
# Make a large data.frame (361,722 rows) with positions in the North Sea:
lon <- seq(-18.025, 32.025, by = 0.05)
lat <- seq(48.025, 66.025, by = 0.05)
df <- expand.grid(lon = lon, lat = lat)
df$area <- NA # Add empty attribute called "area" to assign ecoregions after
# Create a SpatialPointsDataFrame object from dataframe
coordinates(df) <- c("lon", "lat")
# Add projection to SpatialPointsDataFrame object
proj4string(df) <- ices_eco@proj4string
Aim: get the Ecoregion from ices_eco SpatialPolygonsDataFrame
object for each position in the df SpatialPointsDataFrame
object
# Parallel process: using multiple CPUs cores
# Load 'parallel' package for support Parallel computation in R
library('parallel')
# Calculate the number of cores (let one core be free for your Operative System)
no_cores <- detectCores() - 1
# Cut df in "n" parts
# Higher "n": less memory requiered but slower
# Lower "n": more memory requiered but faster
n <- 1000
parts <- split(x = 1:length(df), f = cut(1:length(df), n))
Initiate cluster like this if you are on Mac or Linux OSes
# Initiate Cluster of CPU cores
# Note: you have to define all the used objects in the parallel process
# eg.: ices_eco, df, n, parts, etc. before making the cluster
cl <- makeCluster(no_cores, type = "FORK")
print(cl) # summary of the cluster
Initiate cluster like this if you are on Windows OS
# Initiate Cluster of CPU cores
# Note: you have to define all the used objects in the parallel process
# eg.: ices_eco, df, n, parts, etc. before making the cluster
cl <- makeCluster(no_cores, type = "PSOCK")
# Load libraries on clusters
clusterEvalQ(cl = cl, expr = c(library('sp')))
# All the objects required to run the function
# Objects to export to clusters
clusterExport(cl = cl, varlist = c("ices_eco", "df", "parts", "n"))
print(cl) # summary of the cluster
Continue running the parallel function
# Parallelize sp::over function
# Returns a list with the overlays
system.time(
overParts <- parLapply(cl = cl, X = 1:n, fun = function(x) {
over <- over(df[parts[[x]],], ices_eco)
gc() # release memory
return(over)
})
)
# user system elapsed
# 1.050 1.150 627.111
# Stop Cluster of CPU cores
stopCluster(cl)
# Merge df with ecoregions
for (i in 1:n) {
message(paste("Merging part", i, "of", n))
df$area[parts[[i]]] <- as.character(overParts[[i]]$Ecoregion)
}
Control check
# Control check by random sampling of 20 elements
randomSampling <- sample(x = 1:length(df), size = 20)
chkA <- as.character(over(df[randomSampling,], ices_eco, returnList = FALSE)$Ecoregion) # direct method
chkB <- df$area[randomSampling] # sample df
# chkA should be equal to chkB
print(cbind(chkA, chkB))
# chkA chkB
# [1,] NA NA
# [2,] "Baltic Sea" "Baltic Sea"
# [3,] NA NA
# [4,] "Baltic Sea" "Baltic Sea"
# [5,] "Baltic Sea" "Baltic Sea"
# [6,] NA NA
# [7,] NA NA
# [8,] "Celtic Seas" "Celtic Seas"
# [9,] NA NA
# [10,] NA NA
# [11,] "Baltic Sea" "Baltic Sea"
# [12,] "Oceanic Northeast Atlantic" "Oceanic Northeast Atlantic"
# [13,] "Greater North Sea" "Greater North Sea"
# [14,] NA NA
# [15,] NA NA
# [16,] "Faroes" "Faroes"
# [17,] NA NA
# [18,] NA NA
# [19,] NA NA
# [20,] "Baltic Sea" "Baltic Sea"
Note: if you can access more than one computer, you can use a cluster of computers all going parallel.
More information here:
- how-to-go-parallel-in-r-basics-tips
- parallel-computing-in-r
- running-r-jobs-quickly-on-many-machines
- Parallelizing and Clustering example in R
If you can live with some slight "inaccuracies" on the borders of the polygons, you can
achieve a very fast processing by rasterizing the shapefile, and then extracting the value
of the points on the raster using raster::extract
.
Something like this would work, and outputs a SpatialPointsDataFrame.
library(rgdal)
library(gdalUtils)
library(sf)
library(raster)
library(sf)
library(lazyeval)
library(tidyr)
download.file("http://gis.ices.dk/shapefiles/ICES_ecoregions.zip",
destfile = "ICES_ecoregions.zip")
unzip("ICES_ecoregions.zip")
# read eco region shapefiles
ices_eco <- sf::read_sf("ICES_ecoregions_20150113_no_land.shp")
## Make a large data.frame (361,722 rows) with positions in the North Sea:
lon <- seq(-18.025, 32.025, by=0.05)
lat <- seq(48.025, 66.025, by=0.05)
c <- expand.grid(lon=lon, lat=lat)
# Get the Ecoregion for each position
pings <- SpatialPointsDataFrame(c,
data = data.frame(id = 1:dim(c)[1]),
proj4string = CRS(st_crs(ices_eco)$proj4string))
extr_data = list()
temprast <- tempfile(fileext = ".tif")
tempshape <- tempfile(fileext = ".shp")
counter = 1
for (poly_n in seq_along(along = ices_eco$Ecoregion)) {
message("Working on polygon: ", poly_n)
# extract one polygon andrasterize it with 0.1 deg resolution
subshape <- ices_eco[poly_n,]
st_write(subshape, tempshape, update = TRUE, quiet = TRUE)
# crop the points dataframe on the extent of the rasterized polygon
# to save time
subpoints <- raster::crop(pings, extent(as.numeric(st_bbox(subshape))[c(1,3,2,4)]))
# if there are points "left", extract the value of the rasterized polygon, for each point
if (!is.null(subpoints)) {
rast <- gdal_rasterize(tempshape, temprast, burn = poly_n,
tr = c(0.01,0.01), a_nodata = -999,
output_Raster = TRUE)
extr_points <- raster::extract(rast, subpoints, sp = TRUE)
extr_points <- extr_points[which(extr_points@data[ , 2] == poly_n), ] %>%
as("sf")
# If any of the points fell in the polygon, extract the value and return them in
# the list. otherwise, return null
if ((dim(extr_points)[1]) != 0 ) {
extr_data[[counter]] <- extr_points
counter <- counter + 1
}
}
unlink(temprast)
unlink(tempshape)
}
# Do some juggling to remove duplicates and go back to a SpatialPointsDataFrame
out <- data.table::rbindlist(extr_data) %>%
tibble::as_tibble() %>%
# Following two lines needed to remove few "duplicated" points. I think you get those
# for points falling on the boundary between two cells. Increasing resolution in rasterization
# could avoid this
dplyr::group_by(id) %>%
dplyr::summarize_(ecoregion = interp(~first(var), var = as.name(names(extr_data[[1]])[2])),
geometry = interp(~vargeo[1], vargeo = as.name("geometry"))) %>%
# transform bakck to a spatialpointsdataframe
sf::st_as_sf() %>%
as("Spatial")
# Add the extracted info to the original SpatialPointsDataFrame
pings@data <- dplyr::left_join(pings@data, out@data)
head(pings@data)
summary(pings)
> head(pings@data)
id ecoregion
1 1 17
2 2 17
3 3 17
4 4 17
5 5 17
6 6 17
> summary(pings)
Object of class SpatialPointsDataFrame
Coordinates:
min max
lon -18.025 32.025
lat 48.025 66.025
Is projected: FALSE
proj4string :
[+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0]
Number of points: 361722
Data attributes:
id ecoregion
Min. : 1 Min. : 9.00
1st Qu.: 90431 1st Qu.: 9.00
Median :180862 Median :11.00
Mean :180862 Mean :11.86
3rd Qu.:271292 3rd Qu.:15.00
Max. :361722 Max. :17.00
NA's :180720
On my PC, this completes in about 1 minute.
Note that in the code above, Im also processing one "polygon" at a time to further increase speed
and avoid creating a huge temporary raster file. Additionally, Im using the sf
functions for
reading and writing shapefiles, which are much faster than the rgdal
ones.
Also, I'm rasterizing the shapefile to a 0.01x0.01 degrees resolution'. You can reduce the
possibility of "errors" by increasing the resolution of the rasterization changing the tr
parameter in
rast <- gdal_rasterize(tempshape, temprast, burn = poly_n, tr = c(0.01,0.01), a_nodata = -999, output_Raster = TRUE)
HTH !