how to overlay a polygon over SpatialPointsDataFrame and preserving the SPDF data?
From the sp::over
help:
x = "SpatialPoints", y = "SpatialPolygons" returns a numeric
vector of length equal to the number of points; the number is
the index (number) of the polygon of ‘y’ in which a point
falls; NA denotes the point does not fall in a polygon; if a
point falls in multiple polygons, the last polygon is
recorded.
So if you convert your SpatialPolygonsDataFrame
to SpatialPolygons
you get back a vector of indexes and you can subset your points on NA
:
> over(pts,as(ply,"SpatialPolygons"))
[1] NA 1 1 NA 1 1 NA NA 1 1 1 NA NA 1 1 1 1 1 NA NA NA 1 NA 1 NA
[26] 1 1 1 NA NA NA NA NA 1 1 NA NA NA 1 1 1 NA 1 1 1 NA NA NA 1 1
[51] 1 NA NA NA 1 NA 1 NA 1 NA NA 1 NA 1 1 NA 1 1 NA 1 NA 1 1 1 1
[76] 1 1 1 1 1 NA NA NA 1 NA 1 NA NA NA NA 1 1 NA 1 NA NA 1 1 1 NA
> nrow(pts)
[1] 100
> pts = pts[!is.na(over(pts,as(ply,"SpatialPolygons"))),]
> nrow(pts)
[1] 54
> head(pts@data)
var1 var2
2 0.04001092 v
3 0.58108350 v
5 0.85682609 q
6 0.13683264 y
9 0.13968804 m
10 0.97144627 o
>
For the doubters, here's evidence that the conversion overhead is not a problem:
Two functions - first Jeffrey Evans' method, then my original, then my hacked conversion, then a version based on gIntersects
based on Josh O'Brien's answer:
evans <- function(pts,ply){
prid <- over(pts,ply)
ptid <- na.omit(prid)
pt.poly <- pts[as.numeric(as.character(row.names(ptid))),]
return(pt.poly)
}
rowlings <- function(pts,ply){
return(pts[!is.na(over(pts,as(ply,"SpatialPolygons"))),])
}
rowlings2 <- function(pts,ply){
class(ply) <- "SpatialPolygons"
return(pts[!is.na(over(pts,ply)),])
}
obrien <- function(pts,ply){
pts[apply(gIntersects(columbus,pts,byid=TRUE),1,sum)==1,]
}
Now for a real-world example, I've scattered some random points over the columbus
data set:
require(spdep)
example(columbus)
pts=data.frame(
x=runif(100,5,12),
y=runif(100,10,15),
z=sample(letters,100,TRUE))
coordinates(pts)=~x+y
Looks good
plot(columbus)
points(pts)
Check the functions are doing the same thing:
> identical(evans(pts,columbus),rowlings(pts,columbus))
[1] TRUE
And run 500 times for benchmarking:
> system.time({for(i in 1:500){evans(pts,columbus)}})
user system elapsed
7.661 0.600 8.474
> system.time({for(i in 1:500){rowlings(pts,columbus)}})
user system elapsed
6.528 0.284 6.933
> system.time({for(i in 1:500){rowlings2(pts,columbus)}})
user system elapsed
5.952 0.600 7.222
> system.time({for(i in 1:500){obrien(pts,columbus)}})
user system elapsed
4.752 0.004 4.781
As per my intuition, its not a great overhead, in fact it might be less of an overhead than converting all the row indexes to character and back, or running na.omit to get missing values. Which incidentally leads to another failure mode of the evans
function...
If a row of the polygons data frame is all NA
(which is perfectly valid), then the overlay with SpatialPolygonsDataFrame
for points in that polygon will produce an output data frame with all NA
s, which evans()
will then drop:
> columbus@data[1,]=rep(NA,20)
> columbus@data[5,]=rep(NA,20)
> columbus@data[17,]=rep(NA,20)
> columbus@data[15,]=rep(NA,20)
> set.seed(123)
> pts=data.frame(x=runif(100,5,12),y=runif(100,10,15),z=sample(letters,100,TRUE))
> coordinates(pts)=~x+y
> identical(evans(pts,columbus),rowlings(pts,columbus))
[1] FALSE
> dim(evans(pts,columbus))
[1] 27 1
> dim(rowlings(pts,columbus))
[1] 28 1
>
BUT gIntersects
is faster, even with having to sweep the matrix to check intersections in R rather than in C code. I suspect its the prepared geometry
skills of GEOS, creating spatial indexes - yeah, with prepared=FALSE
it takes a bit longer, about 5.5 seconds.
I'm surprised there isn't a function to either straight return the indices or the points. When I wrote splancs
20 years ago the point-in-polygon functions had both...
sp
provides a shorter form to select features based on spatial intersection, following the OP example:
pts[ply,]
as of:
points(pts[ply,], col = 'red')
Behind the scenes this is short for
pts[!is.na(over(pts, geometry(ply))),]
The thing to note is that there is a geometry
method that drops attributes: over
changes behaviour if its second argument has attributes or not (this was OP's confusion). This works accross all Spatial* classes in sp
, although some over
methods require rgeos
, see this vignette for details, e.g. the case of multiple matches for overlapping polygons.
You were on the right track with over. The rownames of the returned object correspond to the row index of the points. You can implement your exact approach with just a few addition lines of code.
library(sp)
set.seed(357)
pts <- data.frame(x=rnorm(100), y=rnorm(100), var1=runif(100),
var2=sample(letters, 100, replace=TRUE))
coordinates(pts) <- ~ x + y
ply <- matrix(c(-1,-1, 1,-1, 1,1, -1,1, -1,-1), ncol=2, byrow=TRUE)
ply <- SpatialPolygons(list(Polygons(list(Polygon(ply)), ID=1)))
ply <- SpatialPolygonsDataFrame(Sr=ply, data=data.frame(polyvar=357))
# Subset points intersecting polygon
prid <- over(pts,ply)
ptid <- na.omit(prid)
pt.poly <- pts[as.numeric(as.character(row.names(ptid))),]
plot(pts)
axis(1); axis(2)
plot(ply, add=TRUE, border="red")
plot(pt.poly,pch=19,add=TRUE)