Random forest modelling

CODE:

Random Forest modelling

Introduction
Data preparation
Model fitting
Model goodness of fit
Applying the model spatially

Introduction

Here we will take another look at the Random Forest, which you may be familiar with now as this model type was examined during the continuous variable prediction methods section.

This model type can also be used for categorical variables. Some useful extractor functions like print and importance give some useful information about the model performance.

Data Preparation

As described earlier, the data to be used for the following modelling exercises are Terron classes as sampled from the map presented in Malone et al. 2014.The sample data contains 1000 cases of which there are 12 different Terron classes. Before getting into the modeling, we first load in the data and then perform the covariate layer intersection using the suite of environmental variables sourced from the ithir package.

The small selection of covariates cover an area of approximately 220km² at a spatial resolution of 25m. All these variables are derived primarily from an available digital elevation model.

## Load R libraries
library(ithir)
library(sf)
library(terra)
library(randomForest)

# point data of terron classifications
data(hvTerronDat)

# environmental covariates grids (covariate rasters from ithir package)
hv.rasters <- list.files(path = system.file("extdata/", package = "ithir"), pattern = "hunterCovariates_A", full.names = T)
hv.rasters

# read them into R as SpatRaster objects
hv.rasters <- terra::rast(hv.rasters)

Transform the hvTerronDat data to a sf object or simple feature collection which will permit spatial analysis to take place.

## coerce point data to spatial object
hvTerronDat <- sf::st_as_sf(x = hvTerronDat, coords = c("x", "y"))

As these data are of the same spatial projection as the raster data, there is no need to perform a coordinate transformation. So we can perform the intersection immediately.

## Covariate data extraction
DSM_data <- terra::extract(x = hv.rasters, y = hvTerronDat, bind = T, method = "simple")
DSM_data <- as.data.frame(DSM_data)
str(DSM_data)

## 'data.frame':    1000 obs. of  9 variables:

It is always good practice to check to see if any of the observational data returned any NA values for any one of the covariates. If there is NA values, it indicates that the observational data is outside the extent of the covariate layers. It is best to remove these observations from the data set.

## remove missing data
which(!complete.cases(DSM_data))

## integer(0)

DSM_data <- DSM_data[complete.cases(DSM_data), ]

We will also subset the data for an out-of-bag model evaluation using the random hold back strategy. Here 70% of the data for model calibration, leaving the other 30% for validation.

## Random holdback
set.seed(655)
training <- sample(nrow(DSM_data), 0.7 * nrow(DSM_data))

Model fitting

hv.RF <- randomForest(terron ~ hunterCovariates_A_AACN + hunterCovariates_A_elevation + hunterCovariates_A_Hillshading + hunterCovariates_A_light_insolation + hunterCovariates_A_MRVBF + hunterCovariates_A_Slope + hunterCovariates_A_TRI + hunterCovariates_A_TWI, data = DSM_data[training, ], ntree = 500, mtry = 5)

Some model summary diagnostics

# Output random forest model diagnostics
print(hv.RF)

## 
## Call:

##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 44.86%
## Confusion matrix:
##     1 2  3  4  5  6  7  8  9 10 11 12 class.error
## 1  11 1  0  3  0  0  1  0  3  0  0  0   0.4210526
## 2   1 5  3  0  0  0  0  0  0  0  0  0   0.4444444
## 3   0 1 50  4  0  0 11  1  0  0  3  2   0.3055556
## 4   4 1  8 15  0  0  6  2  7  0  0  0   0.6511628
## 5   0 0  0  0 66 10  1  4  4  7  0  0   0.2826087
## 6   0 0  0  0  9 50  0  6  0  2  0  0   0.2537313
## 7   0 0 11  1  5  0 68 21  3  6  2  5   0.4426230
## 8   0 0  2  2  2  5 16 74  1  2  5  0   0.3211009
## 9   3 0  0  3  5  0  4  3 28  0  0  0   0.3913043
## 10  1 0  0  1 16  3 11 12  0  9  0  1   0.8333333
## 11  0 0 14  0  1  0  4 11  0  2  1  0   0.9696970
## 12  0 0  7  0  1  0 14  1  0  2  0  9   0.7352941

# output relative importance of each covariate
importance(hv.RF)

##                                     MeanDecreaseGini
## hunterCovariates_A_AACN                     91.14576
## hunterCovariates_A_elevation               123.87784
## hunterCovariates_A_Hillshading              39.86235
## hunterCovariates_A_light_insolation         48.61020
## hunterCovariates_A_MRVBF                   101.80381
## hunterCovariates_A_Slope                    38.30739
## hunterCovariates_A_TRI                      49.58451
## hunterCovariates_A_TWI                     128.08997

Three types of prediction outputs can be generated from Random Forest models, and are specified via the type parameter of the predict extractor functions. The different types are the response (predicted class), prob (class probabilities) or vote (vote count, which really just appears to return the probabilities).

# Prediction of classes [first 10 cases]
predict(hv.RF, type = "response", newdata = DSM_data[training, ])[1:10]

## 366  79 946 148 710 729 848  20 923 397 
##  12   8   7   5   3   6   5   6  11   5 
## Levels: 1 2 3 4 5 6 7 8 9 10 11 12

# Class probabilities [first 10 cases]
predict(hv.RF, type = "prob", newdata = DSM_data[training, ])[1:10, ]

##     1     2     3     4     5     6     7     8     9    10    11    12
## 366 0 0.000 0.102 0.000 0.000 0.000 0.042 0.004 0.000 0.006 0.022 0.824
## 79  0 0.000 0.000 0.028 0.002 0.000 0.086 0.786 0.002 0.078 0.014 0.004
## 946 0 0.000 0.000 0.014 0.052 0.000 0.704 0.168 0.004 0.044 0.008 0.006
## 148 0 0.000 0.000 0.000 0.800 0.012 0.002 0.002 0.176 0.006 0.000 0.002
## 710 0 0.002 0.842 0.042 0.000 0.000 0.070 0.002 0.002 0.000 0.024 0.016
## 729 0 0.000 0.000 0.000 0.290 0.686 0.000 0.004 0.002 0.018 0.000 0.000
## 848 0 0.000 0.000 0.000 0.946 0.002 0.000 0.002 0.002 0.048 0.000 0.000
## 20  0 0.000 0.000 0.000 0.098 0.894 0.000 0.000 0.000 0.004 0.004 0.000
## 923 0 0.000 0.004 0.022 0.006 0.002 0.052 0.190 0.000 0.020 0.700 0.004
## 397 0 0.000 0.026 0.020 0.664 0.002 0.066 0.020 0.000 0.156 0.026 0.020

Model goodness of fit evaluation

From the diagnostics output of the hv.RF model the confusion matrix is automatically generated, except it was a different orientation to what we have been looking for previous examples. This confusion matrix was performed on what is called the OOB or out-of-bag data i.e. it validates the model/s dynamically with observations withheld from the model fit. So lets just evaluate the model as we have done for the previous models. For calibration:

# Goodness of fit calibration
C.pred.hv.RF <- predict(hv.RF, newdata = DSM_data[training, ])
goofcat(observed = DSM_data$terron[training], predicted = C.pred.hv.RF)

## $confusion_matrix
##     1 2  3  4  5  6   7   8  9 10 11 12
## 1  19 0  0  0  0  0   0   0  0  0  0  0
## 2   0 9  0  0  0  0   0   0  0  0  0  0
## 3   0 0 72  0  0  0   0   0  0  0  0  0
## 4   0 0  0 43  0  0   0   0  0  0  0  0
## 5   0 0  0  0 92  0   0   0  0  0  0  0
## 6   0 0  0  0  0 67   0   0  0  0  0  0
## 7   0 0  0  0  0  0 122   0  0  0  0  0
## 8   0 0  0  0  0  0   0 109  0  0  0  0
## 9   0 0  0  0  0  0   0   0 46  0  0  0
## 10  0 0  0  0  0  0   0   0  0 54  0  0
## 11  0 0  0  0  0  0   0   0  0  0 33  0
## 12  0 0  0  0  0  0   0   0  0  0  0 34
## 
## $overall_accuracy
## [1] 100
## 
## $producers_accuracy
##   1   2   3   4   5   6   7   8   9  10  11  12 
## 100 100 100 100 100 100 100 100 100 100 100 100 
## 
## $users_accuracy
##   1   2   3   4   5   6   7   8   9  10  11  12 
## 100 100 100 100 100 100 100 100 100 100 100 100 
## 
## $kappa
## [1] 1

It seems quite incredible that this particular model is indicating a 100% accuracy. Here it pays to look at the out-of-bag error of the hv.RF model for a better indication of the model goodness of fit. For the random holdback evaluation:

## Goodness of fit out-of-bag
V.pred.hv.RF <- predict(hv.RF, newdata = DSM_data[-training, ])
goofcat(observed = DSM_data$terron[-training], predicted = V.pred.hv.RF)

## $confusion_matrix
##    1 2  3 4  5  6  7  8  9 10 11 12
## 1  5 0  0 1  0  0  0  0  1  0  0  0
## 2  1 0  0 0  0  0  0  0  0  0  0  0
## 3  0 4 22 6  0  0  4  4  0  1  1  1
## 4  2 0  2 7  0  1  2  0  1  0  0  0
## 5  0 0  0 2 29  2  1  0  1  9  2  0
## 6  0 0  0 0  5 20  0  0  1  1  0  0
## 7  0 0  2 8  0  0 38  8  1  7  8  5
## 8  0 0  0 2  2  2  3 29  1  4  2  2
## 9  0 0  1 5  1  0  3  0 14  0  0  0
## 10 0 0  0 0  3  0  2  0  0  1  0  0
## 11 0 0  0 1  0  0  0  0  0  0  2  0
## 12 0 0  0 0  0  0  0  0  0  1  0  3
## 
## $overall_accuracy
## [1] 57
## 
## $producers_accuracy
##  1  2  3  4  5  6  7  8  9 10 11 12 
## 63  0 82 22 73 80 72 71 70  5 14 28 
## 
## $users_accuracy
##  1  2  3  4  5  6  7  8  9 10 11 12 
## 72  0 52 47 64 75 50 62 59 17 67 75 
## 
## $kappa
## [1] 0.5067225

So based on the model validation, the Random Forest performs quite similarly to the other models that were used before, despite a perfect performance based on the diagnostics of the calibration model.

Apply model spatially

And finally the map that results from applying the hv.RF model to the covariate rasters.

## Apply model spatially class prediction
map.RF.c <- terra::predict(object = hv.rasters, model = hv.RF)

## plotting
map.RF.c <- as.factor(map.RF.c)

## HEX colors
area_colors <- c("#FF0000", "#38A800", "#73DFFF", "#FFEBAF", "#A8A800", "#0070FF", "#FFA77F", "#7AF5CA", "#D7B09E", "#CCCCCC", "#B4D79E", "#FFFF00")

terra::plot(map.RF.c, col = area_colors, plg = list(legend = c("HVT_001", "HVT_002", "HVT_003", "HVT_004", "HVT_005", "HVT_006", "HVT_007", "HVT_008", "HVT_009", "HVT_010", "HVT_011", "HVT_012"), cex = 0.75, title = "Terron Class"))

rconsole — Hunter Valley Terron class map created using random forest model.

References

Malone, B P, P Hughes, A B McBratney, and B Minsany. 2014. “A Model for the Identification of Terrons in the Lower Hunter Valley, Australia.” Geoderma Regional 1: 31–47.