How to perform Random Forest land cover classification?
I am not sure that I understand what you mean by "collect" data. If you are referring to heads-up digitizing and assignment of classes, this is best done in a GIS. There are many free options that would be suitable (i..e, QGIS, GRASS). Ideally you would have field data to train your classification.
The procedure for classification using Random Forests is fairly straight forward. You can read in your training data (i.e., a point shapefile) using "rgdal" or "maptools", read in your spectral data using raster::stack
, assign the raster values to your training points using raster:extract
and then pass this to randomForest
. You will need to coerce your "class" column into a factor to have RF recognize the model as a classification instance. Once you have a fit model you can use the predict function, passing it you raster stack. You will need to pass the standard arguments to predict in addition to ones specific to the raster predict function. The raster package has the ability to handle rasters "out of memory" and as such is memory safe, even with very large rasters. One of the arguments in the raster predict function is "filename" allowing for a raster to written to disk. For a multiclass problem you will need to set type="response" and index=1 which will output an integer raster of your classes.
There are a few caveats that should be noted:
- You cannot have more than 32 levels in your response variable (y) or any factor on the right side of the equation (x)
- Your classes must be balanced. A 30% rule is a good one to follow, that is if you have more than 30% more observations on one class than any other your problem becomes imbalanced and the results can be biased
- It is a misnomer that RF cannot overfit. If you over correlate your ensemble you can overfit the model. A good way to avoid this is to run a preliminary model and plot the error stabilization. As a rule of thumb, I choose 2X the number of bootstraps required to stabilize the error for the ntree parameter. This is because variable interaction stabilizes at a slower rate than error. If you are not including many variables in the model you can be much more conservative with this parameter.
- Do not use node purity as a measure of variable importance. It is not permuted like the mean decrease in accuracy.
I have functions for model selection, class imbalance and validation in the rfUtilities package available on CRAN.
Here is some simple code to get you started.
require(sp)
require(rgdal)
require(raster)
require(randomForest)
# CREATE LIST OF RASTERS
rlist=list.files(getwd(), pattern="img$", full.names=TRUE)
# CREATE RASTER STACK
xvars <- stack(rlist)
# READ POINT SHAPEFILE TRAINING DATA
sdata <- readOGR(dsn=getwd() layer=inshape)
# ASSIGN RASTER VALUES TO TRAINING DATA
v <- as.data.frame(extract(xvars, sdata))
sdata@data = data.frame(sdata@data, v[match(rownames(sdata@data), rownames(v)),])
# RUN RF MODEL
rf.mdl <- randomForest(x=sdata@data[,3:ncol(sdata@data)], y=as.factor(sdata@data[,"train"]),
ntree=501, importance=TRUE)
# CHECK ERROR CONVERGENCE
plot(rf.mdl)
# PLOT mean decrease in accuracy VARIABLE IMPORTANCE
varImpPlot(rf.mdl, type=1)
# PREDICT MODEL
predict(xvars, rf.mdl, filename="RfClassPred.img", type="response",
index=1, na.rm=TRUE, progress="window", overwrite=TRUE)
I know that this thread is a little old, but for anyone wanting to try classification of remote sensing data in R
, a very promising new package has been released.
install.packages("RSToolbox")
It comes with functions for both unsupervised and supervised classification (using random forests). More information can be found here - http://bleutner.github.io/RStoolbox/
Here and here are tutorials on supervised classification / regression with R, which includes a RandomForest examples.