C5 decision trees

CODE:

C5 decision trees

Introduction
Data preparation
Model fitting
Model goodness of fit
Applying the model spatially

Introduction

The C5 decision tree model is available in the C50 package. The function C5.0 fits classification tree models or rule-based models using Quinlans’s C5.0 algorithm (Quinlan 1993).

Essentially we will go through the same process as we did for the multinomial logistic regression.

The C5.0 function and its internal parameters are similar in nature to that of the Cubist function for predicting continuous variables. The trials parameter lets you implement a boosted classification tree process, with the results aggregated at the end. There are also many other useful model tuning parameters in the C5.0Control parameter set too that are worth a look. Just see the help files for more information.

Data Preparation

The data to be used for the following modelling exercises are Terron classes as sampled from the map presented in Malone et al. (2014). The sample data contains 1000 cases of which there are 12 different Terron classes. Before getting into the modeling, we first load in the data and then perform the covariate layer intersection using the suite of environmental variables sourced from the ithir package.

The small selection of covariates cover an area of approximately 220km² at a spatial resolution of 25m. All these variables are derived primarily from an available digital elevation model.

## load libraries
library(ithir)
library(sf)
library(terra)
library(C50)

# point data of terron classifications
data(hvTerronDat)

# environmental covariates grids (covariate rasters from ithir package)
hv.rasters <- list.files(path = system.file("extdata/", package = "ithir"), pattern = "hunterCovariates_A", full.names = T)
hv.rasters

# read them into R as SpatRaster objects
hv.rasters <- terra::rast(hv.rasters)

Transform the hvTerronDat data to a sf object or simple feature collection which will permit spatial analysis to take place.

## coerce point data to spatial object
hvTerronDat <- sf::st_as_sf(x = hvTerronDat, coords = c("x", "y"))

As these data are of the same spatial projection as the raster data, there is no need to perform a coordinate transformation. So we can perform the intersection immediately.

## Covariate data extraction
DSM_data <- terra::extract(x = hv.rasters, y = hvTerronDat, bind = T, method = "simple")
DSM_data <- as.data.frame(DSM_data)
str(DSM_data)

## 'data.frame':    1000 obs. of  9 variables:

It is always good practice to check to see if any of the observational data returned any NA values for any one of the covariates. If there is NA values, it indicates that the observational data is outside the extent of the covariate layers. It is best to remove these observations from the data set.

## remove missing data
which(!complete.cases(DSM_data))

## integer(0)

DSM_data <- DSM_data[complete.cases(DSM_data), ]

We will also subset the data for an external validation i.e. random hold back validation. Here 70% of the data for model calibration, leaving the other 30% for validation.

## Random holdback
set.seed(655)
training <- sample(nrow(DSM_data), 0.7 * nrow(DSM_data))

Model fitting

With the calibration data prepared, we will fit the C5 model as in the script below.

hv.C5 <- C5.0(x = DSM_data[training, 2:9], y = DSM_data$terron[training], trials = 1, rules = FALSE, control = C5.0Control(CF = 0.95, minCases = 20, earlyStopping = FALSE))

By calling the summary function, some useful model fit diagnostics are given. These include the tree structure, together with the covariates used and mis-classification error of the model, as well as a rudimentary confusion matrix. A useful feature of the C5 model is that it can omit in an automated fashion, unnecessary covariates.

The predict function can either return the predicted class or the confidence of the predictions (which is controlled using the type= prob parameter. The probabilities are quantified such that if an observation is classified by a single leaf of a decision tree, the confidence value is the proportion of training cases at that leaf that belong to the predicted class. If more than one leaf is involved (i.e. one or more of the attributes on the path has missing values), the value is a weighted sum of the individual leaves’ confidences.

For rule-classifiers, each applicable rule votes for a class with the voting weight being the rule’s confidence value. If the sum of the votes for class C is W(C), then the predicted class P is chosen so that W(P) is maximal, and the confidence is the greater of (1), the voting weight of the most specific applicable rule for predicted class P; or (2) the average vote for class P (so, W(P) divided by the number of applicable rules for class P).Boosted classifiers are similar, but individual classifiers vote for a class with weight equal to their confidence value. Overall, the confidence associated with each class for every observation are made to sum to 1.

# return the class predictions
predict(hv.C5, newdata = DSM_data[training, ])[1:10]

##  [1] 3 8 8 5 7 5 5 6 8 5
## Levels: 1 2 3 4 5 6 7 8 9 10 11 12

# return the class probabilities (Terron classes 1 to 3, first 10 cases)
predict(hv.C5, newdata = DSM_data[training, ], type = "prob")[1:10, 1:3]

##                1            2            3
## 366 0.0002918587 1.382488e-04 0.4742242759
## 79  0.0001675485 7.936508e-05 0.0006349206
## 946 0.0001675485 7.936508e-05 0.0006349206
## 148 0.0007983193 3.781512e-04 0.0030252100
## 710 0.0006031746 2.857143e-04 0.3133968224
## 729 0.0011309523 5.357143e-04 0.0042857142
## 848 0.0003270224 1.549053e-04 0.0012392427
## 20  0.0010052910 4.761905e-04 0.0038095239
## 923 0.0001675485 7.936508e-05 0.0006349206
## 397 0.0003270224 1.549053e-04 0.0012392427

Model goodness of fit

So lets look at the calibration and out-of-bag model goodness of fit evaluations. First, the calibration statistics:

## Goodness of fit calibration data
C.pred.hv.C5 <- predict(hv.C5, newdata = DSM_data[training, ])
goofcat(observed = DSM_data$terron[training], predicted = C.pred.hv.C5)

## $confusion_matrix
##     1 2  3  4  5  6  7  8  9 10 11 12
## 1  16 3  0  5  0  0  0  0  2  1  0  0
## 2   0 0  0  0  0  0  0  0  0  0  0  0
## 3   0 6 54  1  0  0 16  2  0  2 14 17
## 4   0 0  1 12  0  0  3  1  3  1  0  0
## 5   0 0  0  2 82  9  9  1 12 18  3  2
## 6   0 0  0  0  4 46  0  0  0  1  0  0
## 7   0 0 17 10  0  0 70 12  6 10  3 14
## 8   0 0  0  4  4  9 24 89  8 12 11  0
## 9   3 0  0  9  0  0  0  0 15  0  0  0
## 10  0 0  0  0  2  3  0  4  0  9  2  1
## 11  0 0  0  0  0  0  0  0  0  0  0  0
## 12  0 0  0  0  0  0  0  0  0  0  0  0
## 
## $overall_accuracy
## [1] 57
## 
## $producers_accuracy
##  1  2  3  4  5  6  7  8  9 10 11 12 
## 85  0 75 28 90 69 58 82 33 17  0  0 
## 
## $users_accuracy
##   1   2   3   4   5   6   7   8   9  10  11  12 
##  60 NaN  49  58  60  91  50  56  56  43 NaN NaN 
## 
## $kappa
## [1] 0.4969075

It will be noticed that some of the Terron classes failed to be predicted by the fitted model. For example Terron classes 2, 11, and 12 were all predicted as being a different class. All observations of Terron class 2 were predicted as Terron class 1. Doing the out-of-bag model evaluation we return the following:

## Goodness of fit out-of-bag data
V.pred.hv.C5 <- predict(hv.C5, newdata = DSM_data[-training, ]) goofcat(observed = DSM_data$terron[-training], predicted = V.pred.hv.C5)

## $confusion_matrix
##    1 2  3 4  5  6  7  8 9 10 11 12
## 1  8 0  0 0  0  0  0  0 1  0  0  0
## 2  0 0  0 0  0  0  0  0 0  0  0  0
## 3  0 3 18 4  0  0  5  6 0  2  7  4
## 4  0 0  1 7  0  1  0  1 0  0  0  1
## 5  0 0  0 7 32  2  3  0 5 11  1  2
## 6  0 0  0 0  3 16  0  1 1  0  0  0
## 7  0 1  8 8  1  0 31  4 4  5  2  2
## 8  0 0  0 3  2  4 13 27 4  5  4  2
## 9  0 0  0 3  0  0  1  0 5  0  0  0
## 10 0 0  0 0  2  2  0  2 0  1  1  0
## 11 0 0  0 0  0  0  0  0 0  0  0  0
## 12 0 0  0 0  0  0  0  0 0  0  0  0
## 
## $overall_accuracy
## [1] 49
## 
## $producers_accuracy
##   1   2   3   4   5   6   7   8   9  10  11  12 
## 100   0  67  22  80  64  59  66  25   5   0   0 
## 
## $users_accuracy
##   1   2   3   4   5   6   7   8   9  10  11  12 
##  89 NaN  37  64  51  77  47  43  56  13 NaN NaN 
## 
## $kappa
## [1] 0.4092538

Apply model spatially

And finally, creating the map derived from the hv.C5 model using the terra::predict function. Note that the C5 model returned 0% producer’s accuracy for Terron classes 2, 11 and 12. These data account for only a small proportion of the data set, and/or, they may be similar to other existing Terron classes (based on the available predictive covariates). Consequently, they did not feature in the hv.C5 model and ultimately, the final map.

Another curiosity is that the model predictions also extends beyond the normal mapping boundary which does not mean much but one would want to remove these predictions as it causes the map product to look strange and potentially misleading. First we create a mask, msk and then use the mask function to blot out the areas where we do not need to have model predictions.

## Applying model spatially class prediction
map.C5.c <- terra::predict(object = hv.rasters, model = hv.C5, type = "class")

# mask [the C5 model extends to outside the mapping area]
msk <- terra::ifel(is.na(hv.rasters[[1]]), NA, 1)
map.C5.c.masked <- terra::mask(map.C5.c, msk, inverse = F)

# plot map
map.C5.c.masked <- as.factor(map.C5.c.masked)
# some classes were dropped out !!
levels(map.C5.c.masked)[[1]]

## HEX colors
area_colors <- c("#FF0000", "#73DFFF", "#FFEBAF", "#A8A800",  "#0070FF", "#FFA77F", "#7AF5CA", "#D7B09E", "#CCCCCC")

terra::plot(map.C5.c.masked, col = area_colors, plg = list(legend = c("HVT_001", "HVT_003", "HVT_004", "HVT_005", "HVT_006", "HVT_007", "HVT_008", "HVT_009", "HVT_010"), cex = 0.75, title = "Terron Class"))

rconsole — Hunter Valley Terron class map created using C5 decision tree model

References

Malone, B P, P Hughes, A B McBratney, and B Minsany. 2014. “A Model for the Identification of Terrons in the Lower Hunter Valley, Australia.” Geoderma Regional 1: 31–47.

Quinlan, J R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, California.