CODE:
Get the code used for this section here
A very brief introduction to caret
package for digital soil mapping
- Introduction
- Data preparation
- Perform the covariate intersection
- Model fitting: Multiple linear regression
- Model fitting: Other model interfaces
- Spatial prediction
Introduction
It becomes quickly apparent that there are many variants of prediction functions that could be used for DSM. As was observed from previous examples, each of the models used have their relative advantages and disadvantages. Each also has their own specific parameterisations and quirks for fitting.
Sometimes for the various hyper-parameters that are used for model training are chosen without any sort of optimisation, even due consideration sometimes. Usually we are just happy with the defaults and don’t think much about how sensitive model outcomes might be as a result of using defaults or blindly choosing a hyperparameter set.
Sometimes we might be confronted with many possible model structures to use, it is often difficult to make a choice what to use, and just default with a model we know well or have used often without considering alternatives.
This is where the caret R package comes into its own in terms of efficiency and streamlining the workflow for fitting models and optimising some of those parameter variables. As the dedicated website indicates, the caret
package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models.
As we have seen, there are many different modeling functions in R
. Some have different syntax for model training and/or prediction. The caret
package provides a uniform interface to the various functions themselves, as well as a way to standardize common tasks (such parameter tuning and variable importance).
Data preparation
So lets firstly get the data organized. Recall from before in the data preparatory exercises that we were working with the soil point data and environmental covariates for the Hunter Valley area. These data are stored in the HV_subsoilpH
object together with associated rasters supplied with the ithir
package.
For the succession of models examined in these various pages, we will concentrate on modelling and mapping the soil pH for the 60-100cm depth interval. To refresh, lets load the data in, then intersect the data with the available covariates.
# libraries
library(ithir)
library(terra)
library(sf)
library(caret)
# point data
data(HV_subsoilpH)
# Start afresh round pH data to 2 decimal places
HV_subsoilpH$pH60_100cm <- round(HV_subsoilpH$pH60_100cm, 2)
# remove already intersected data
HV_subsoilpH <- HV_subsoilpH[, 1:3]
# add an id column
HV_subsoilpH$id <- seq(1, nrow(HV_subsoilpH), by = 1)
# re-arrange order of columns
HV_subsoilpH <- HV_subsoilpH[, c(4, 1, 2, 3)]
# Change names of coordinate columns
names(HV_subsoilpH)[2:3] <- c("x", "y")
# save a copy of coordinates
HV_subsoilpH$x2 <- HV_subsoilpH$x
HV_subsoilpH$y2 <- HV_subsoilpH$y
# convert data to sf object
HV_subsoilpH <- sf::st_as_sf(x = HV_subsoilpH, coords = c("x", "y"))
# grids (covariate rasters from ithir package)
hv.sub.rasters <- list.files(path = system.file("extdata/", package = "ithir"), pattern = "hunterCovariates_sub", full.names = TRUE)
# read them into R as SpatRaster objects
hv.sub.rasters <- terra::rast(hv.sub.rasters)
Perform the covariate intersection
# extract covariate data
DSM_data <- terra::extract(x = hv.sub.rasters, y = HV_subsoilpH, bind = T, method = "simple")
DSM_data <- as.data.frame(DSM_data)
Often it is handy to check to see whether there are missing values both in the target variable and of the covariates. It is possible that a point location does not fit within the extent of the available covariates. In these cases the data should be excluded. A quick way to assess whether there are missing or NA
values in the data is to use the complete.cases
function.
which(!complete.cases(DSM_data))
## integer(0)
DSM_data <- DSM_data[complete.cases(DSM_data), ]
There do not appear to be any missing data as indicated by the integer(0)
output above i.e there are zero rows with missing information.
Model fitting: Multiple linear regression
The workhorse of the caret
package is the train
function. We can specify the model to be fitted in two ways:
# 1.
fit <- train(form = pH60_100cm ~ AACN + Landsat_Band1 + Elevation + Hillshading +
Mid_Slope_Positon + MRVBF + NDVI + TWI, data = DSM_data, method = "lm")
# or 2.
fit <- train(x = DSM_data[, c(6, 7, 8, 9, 11, 12, 13, 14)], y = DSM_data$pH60_100cm,
method = "lm")
Using the summary(fit)
command brings up the model parameter estimates, while the object fit
also contains a summary of useful model goodness of fit diagnostics such as the RMSE and R2 statistics.
You can control how model evaluation is done where options include simple goodness of fit calibration, k-fold cross-validation, and leave-one-out cross-validation.
As you might recall, these sorts of processes were examined during the cross-validation page.
This option is controlled using the parameter trControl
in the train
function. The example below illustrates a 5-fold cross validation of a linear regression model with 10 repetitions.
fit <- train(x = DSM_data[, c(6, 7, 8, 9, 11, 12, 13, 14)], y = DSM_data$pH60_100cm,
method = "lm", trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10))
fit
## Linear Regression
##
## 506 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 10 times)
## Summary of sample sizes: 406, 405, 405, 403, 405, 405, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 1.189996 0.2233692 0.9439133
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Model fitting: Other model interfaces
There are a lot of potential models that you could consider too for DSM. Check out this link or print them as below:
list_of_models <- modelLookup()
head(list_of_models)
## model parameter label forReg forClass probModel
## 1 ada iter #Trees FALSE TRUE TRUE
## 2 ada maxdepth Max Tree Depth FALSE TRUE TRUE
## 3 ada nu Learning Rate FALSE TRUE TRUE
## 4 AdaBag mfinal #Trees FALSE TRUE TRUE
## 5 AdaBag maxdepth Max Tree Depth FALSE TRUE TRUE
## 9 adaboost nIter #Trees FALSE TRUE TRUE
# The number of models caret interfaces with
nrow(list_of_models)
## [1] 511
You can choose which model to use in the train
function with the method
option. You will note that the fitting of the Cubist and Random Forest models below automatically attempt to optimise some of the fitting parameters, for example the mtry
parameter for Random Forest.
To look at what parameters can optimised for each model in caret we can use the modelLookup
function. By default a tuning grid is used to find the optimal model hyper-parameters, but this can be controlled manually by the creation of user defined tuning grids for customised grid search
workflows. Below is an example of using the default grid search.
# Cubist model
modelLookup(model = "cubist")
## model parameter label forReg forClass probModel
## 1 cubist committees #Committees TRUE FALSE FALSE
## 2 cubist neighbors #Instances TRUE FALSE FALSE
fit_cubist <- train(x = DSM_data[, c(6, 7, 8, 9, 11, 12, 13, 14)], y = DSM_data$pH60_100cm,
method = "cubist", trControl = trainControl(method = "cv", number = 5))
# model summary
fit_cubist
## Cubist
##
## 506 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 403, 404, 406, 405, 406
## Resampling results across tuning parameters:
##
## committees neighbors RMSE Rsquared MAE
## 1 0 1.196051 0.2210315 0.9396362
## 1 5 1.199069 0.2336880 0.9322461
## 1 9 1.179916 0.2420817 0.9291179
## 10 0 1.184225 0.2345799 0.9314920
## 10 5 1.193676 0.2386040 0.9300559
## 10 9 1.172219 0.2498648 0.9236565
## 20 0 1.185989 0.2324439 0.9321528
## 20 5 1.193865 0.2381857 0.9304084
## 20 9 1.172989 0.2488805 0.9247550
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 10 and neighbors = 9.
# random forest model
modelLookup(model = "rf")
## model parameter label forReg forClass probModel
## 1 rf mtry #Randomly Selected Predictors TRUE TRUE TRUE
fit_rf <- train(x = DSM_data[, c(6, 7, 8, 9, 11, 12, 13, 14)], y = DSM_data$pH60_100cm,
method = "rf", trControl = trainControl(method = "cv", number = 5))
# model summary
fit_rf
## Random Forest
##
## 506 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 405, 406, 404, 404, 405
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 1.217696 0.1905925 0.9606279
## 5 1.212227 0.1992767 0.9517521
## 8 1.211337 0.2013908 0.9501383
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 8.
Spatial prediction
Using the fitted model, predictions can be achieved with the native predict
or terra::predict
functions depending whether the prediction is extended to data table or directly to raster data.
## Extend model to data
pred_cubist <- predict(fit_cubist, DSM_data)
## Extend model to raster data
pred_cubistMap <- terra::predict(object = hv.sub.rasters, model = fit_cubist)
plot(pred_cubistMap)
There is plenty of other added functionality of the caret
package. In addition to the detailed resources mentioned above, it always pays to look over the help files that are associated with each function.
It is a matter of personal choice whether you use the functionality of caret
or take more control of the underlying processes via working directly with the model function directly. This also applies for handling cross-validation and external validation where you may wish to take greater control over the workflow rather than using the built-in functionality provide in caret
.