Introduction to caret

CODE:

A very brief introduction to `caret` package for digital soil mapping

Introduction
Data preparation
Perform the covariate intersection
Model fitting: Multiple linear regression
Model fitting: Other model interfaces
Spatial prediction

Introduction

It becomes quickly apparent that there are many variants of prediction functions that could be used for DSM. As was observed from previous examples, each of the models used have their relative advantages and disadvantages. Each also has their own specific parameterisations and quirks for fitting.

Sometimes for the various hyper-parameters that are used for model training are chosen without any sort of optimisation, even due consideration sometimes. Usually we are just happy with the defaults and don’t think much about how sensitive model outcomes might be as a result of using defaults or blindly choosing a hyperparameter set.

Sometimes we might be confronted with many possible model structures to use, it is often difficult to make a choice what to use, and just default with a model we know well or have used often without considering alternatives.

This is where the caret R package comes into its own in terms of efficiency and streamlining the workflow for fitting models and optimising some of those parameter variables. As the dedicated website indicates, the caret package (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models.

As we have seen, there are many different modeling functions in R. Some have different syntax for model training and/or prediction. The caret package provides a uniform interface to the various functions themselves, as well as a way to standardize common tasks (such parameter tuning and variable importance).

Data preparation

So lets firstly get the data organized. Recall from before in the data preparatory exercises that we were working with the soil point data and environmental covariates for the Hunter Valley area. These data are stored in the HV_subsoilpH object together with associated rasters supplied with the ithir package.

For the succession of models examined in these various pages, we will concentrate on modelling and mapping the soil pH for the 60-100cm depth interval. To refresh, lets load the data in, then intersect the data with the available covariates.

# libraries
library(ithir)
library(terra)
library(sf)
library(caret)

# point data
data(HV_subsoilpH)

# Start afresh round pH data to 2 decimal places
HV_subsoilpH$pH60_100cm <- round(HV_subsoilpH$pH60_100cm, 2)

# remove already intersected data
HV_subsoilpH <- HV_subsoilpH[, 1:3]

# add an id column
HV_subsoilpH$id <- seq(1, nrow(HV_subsoilpH), by = 1)

# re-arrange order of columns
HV_subsoilpH <- HV_subsoilpH[, c(4, 1, 2, 3)]

# Change names of coordinate columns
names(HV_subsoilpH)[2:3] <- c("x", "y")
# save a copy of coordinates
HV_subsoilpH$x2 <- HV_subsoilpH$x
HV_subsoilpH$y2 <- HV_subsoilpH$y

# convert data to sf object
HV_subsoilpH <- sf::st_as_sf(x = HV_subsoilpH, coords = c("x", "y"))

# grids (covariate rasters from ithir package)
hv.sub.rasters <- list.files(path = system.file("extdata/", package = "ithir"), pattern = "hunterCovariates_sub", full.names = TRUE)

# read them into R as SpatRaster objects
hv.sub.rasters <- terra::rast(hv.sub.rasters)

Perform the covariate intersection

# extract covariate data
DSM_data <- terra::extract(x = hv.sub.rasters, y = HV_subsoilpH, bind = T, method = "simple")
DSM_data <- as.data.frame(DSM_data)

Often it is handy to check to see whether there are missing values both in the target variable and of the covariates. It is possible that a point location does not fit within the extent of the available covariates. In these cases the data should be excluded. A quick way to assess whether there are missing or NA values in the data is to use the complete.cases function.

which(!complete.cases(DSM_data))

## integer(0)

DSM_data <- DSM_data[complete.cases(DSM_data), ]

There do not appear to be any missing data as indicated by the integer(0) output above i.e there are zero rows with missing information.

Model fitting: Multiple linear regression

The workhorse of the caret package is the train function. We can specify the model to be fitted in two ways:

# 1.
fit <- train(form = pH60_100cm ~ AACN + Landsat_Band1 + Elevation + Hillshading + 
    Mid_Slope_Positon + MRVBF + NDVI + TWI, data = DSM_data, method = "lm")

# or 2.
fit <- train(x = DSM_data[, c(6, 7, 8, 9, 11, 12, 13, 14)], y = DSM_data$pH60_100cm, 
    method = "lm")

Using the summary(fit) command brings up the model parameter estimates, while the object fit also contains a summary of useful model goodness of fit diagnostics such as the RMSE and R² statistics.

You can control how model evaluation is done where options include simple goodness of fit calibration, k-fold cross-validation, and leave-one-out cross-validation.

As you might recall, these sorts of processes were examined during the cross-validation page.

This option is controlled using the parameter trControl in the train function. The example below illustrates a 5-fold cross validation of a linear regression model with 10 repetitions.

fit <- train(x = DSM_data[, c(6, 7, 8, 9, 11, 12, 13, 14)], y = DSM_data$pH60_100cm, 
    method = "lm", trControl = trainControl(method = "repeatedcv", number = 5, repeats = 10))

fit

## Linear Regression 
## 
## 506 samples
##   8 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 10 times) 
## Summary of sample sizes: 406, 405, 405, 403, 405, 405, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE      
##   1.189996  0.2233692  0.9439133
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Model fitting: Other model interfaces

There are a lot of potential models that you could consider too for DSM. Check out this link or print them as below:

list_of_models <- modelLookup()
head(list_of_models)

##      model parameter          label forReg forClass probModel
## 1      ada      iter         #Trees  FALSE     TRUE      TRUE
## 2      ada  maxdepth Max Tree Depth  FALSE     TRUE      TRUE
## 3      ada        nu  Learning Rate  FALSE     TRUE      TRUE
## 4   AdaBag    mfinal         #Trees  FALSE     TRUE      TRUE
## 5   AdaBag  maxdepth Max Tree Depth  FALSE     TRUE      TRUE
## 9 adaboost     nIter         #Trees  FALSE     TRUE      TRUE

# The number of models caret interfaces with
nrow(list_of_models)

## [1] 511

You can choose which model to use in the train function with the method option. You will note that the fitting of the Cubist and Random Forest models below automatically attempt to optimise some of the fitting parameters, for example the mtry parameter for Random Forest.

To look at what parameters can optimised for each model in caret we can use the modelLookup function. By default a tuning grid is used to find the optimal model hyper-parameters, but this can be controlled manually by the creation of user defined tuning grids for customised grid search workflows. Below is an example of using the default grid search.

# Cubist model
modelLookup(model = "cubist")

##    model  parameter       label forReg forClass probModel
## 1 cubist committees #Committees   TRUE    FALSE     FALSE
## 2 cubist  neighbors  #Instances   TRUE    FALSE     FALSE

fit_cubist <- train(x = DSM_data[, c(6, 7, 8, 9, 11, 12, 13, 14)], y = DSM_data$pH60_100cm, 
    method = "cubist", trControl = trainControl(method = "cv", number = 5))

# model summary
fit_cubist

## Cubist 
## 
## 506 samples
##   8 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 403, 404, 406, 405, 406 
## Resampling results across tuning parameters:
## 
##   committees  neighbors  RMSE      Rsquared   MAE      
##    1          0          1.196051  0.2210315  0.9396362
##    1          5          1.199069  0.2336880  0.9322461
##    1          9          1.179916  0.2420817  0.9291179
##   10          0          1.184225  0.2345799  0.9314920
##   10          5          1.193676  0.2386040  0.9300559
##   10          9          1.172219  0.2498648  0.9236565
##   20          0          1.185989  0.2324439  0.9321528
##   20          5          1.193865  0.2381857  0.9304084
##   20          9          1.172989  0.2488805  0.9247550
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were committees = 10 and neighbors = 9.

# random forest model
modelLookup(model = "rf")

##   model parameter                         label forReg forClass probModel
## 1    rf      mtry #Randomly Selected Predictors   TRUE     TRUE      TRUE

fit_rf <- train(x = DSM_data[, c(6, 7, 8, 9, 11, 12, 13, 14)], y = DSM_data$pH60_100cm, 
    method = "rf", trControl = trainControl(method = "cv", number = 5))

# model summary
fit_rf

## Random Forest 
## 
## 506 samples
##   8 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 405, 406, 404, 404, 405 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE      
##   2     1.217696  0.1905925  0.9606279
##   5     1.212227  0.1992767  0.9517521
##   8     1.211337  0.2013908  0.9501383
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 8.

Spatial prediction

Using the fitted model, predictions can be achieved with the native predict or terra::predict functions depending whether the prediction is extended to data table or directly to raster data.

## Extend model to data
pred_cubist <- predict(fit_cubist, DSM_data)

## Extend model to raster data
pred_cubistMap <- terra::predict(object = hv.sub.rasters, model = fit_cubist)
plot(pred_cubistMap)

There is plenty of other added functionality of the caret package. In addition to the detailed resources mentioned above, it always pays to look over the help files that are associated with each function.

It is a matter of personal choice whether you use the functionality of caret or take more control of the underlying processes via working directly with the model function directly. This also applies for handling cross-validation and external validation where you may wish to take greater control over the workflow rather than using the built-in functionality provide in caret.