CODE:

Get the code used for this section here

Digital soil mapping using Decision Tree algorithm

Introduction

Linear regression is a global model, where there is a single predictive formula holding over the entire data space. With a linear model we therefore make some assumptions about how our target variable relates to the covariates.

These may often hold. However, it is models that allow one the flexibility of modelling non-linearity that are increasingly popular in the DSM community. One of these model structures are classification and regression trees (CART).

These models are a non-parametric decision tree learning algorithm that produces either classification or regression trees.

In this page we will concentrate on regression trees because our target variable is numeric i.e. a continuous variable. On another page we will look at classification trees for categorical variables. Decision trees (either regression or classification) are formed by a collection of rules based on variables in the modeling data set:

  • Rules based on variables’ values are selected to get the best split to differentiate observations based on the dependent variable.

  • Once a rule is selected and splits a node into two, the same process is applied to each subsequent node (i.e. it is a recursive procedure).

  • Splitting stops when CART detects no further gain can be made, or some pre-set stopping rules are met. Alternatively, the data are split as much as possible and then the tree is later pruned.

Each branch of the tree ends in a terminal node. Each observation falls into one and exactly one terminal node, and each terminal node is uniquely defined by a set of rules.

For a regression tree, the terminal node is a single value, or could be a regression model (which is the case for Cubist models which we will look at later).

Implementation of regression trees in R is provided both through the rpart and party packages. We will use the rpart package and its rpart function. However, the party package through the ctree function offers more functionality, and implements the partitioning in a more statistical robust fashion. Both functions however can handle both continuous and categorical predictor variables.

Back to top

Data preparation

So lets firstly get the data organized. Recall from before in the data preparatory exercises that we were working with the soil point data and environmental covariates for the Hunter Valley area. These data are stored in the HV_subsoilpH object together with associated rasters supplied with the ithir package.

For the succession of models examined in these various pages, we will concentrate on modelling and mapping the soil pH for the 60-100cm depth interval. To refresh, lets load the data in, then intersect the data with the available covariates.

# libraries
library(ithir)
library(terra)
library(sf)
library(rpart)

# point data
data(HV_subsoilpH)

# Start afresh round pH data to 2 decimal places
HV_subsoilpH$pH60_100cm <- round(HV_subsoilpH$pH60_100cm, 2)

# remove already intersected data
HV_subsoilpH <- HV_subsoilpH[, 1:3]

# add an id column
HV_subsoilpH$id <- seq(1, nrow(HV_subsoilpH), by = 1)

# re-arrange order of columns
HV_subsoilpH <- HV_subsoilpH[, c(4, 1, 2, 3)]

# Change names of coordinate columns
names(HV_subsoilpH)[2:3] <- c("x", "y")
# save a copy of coordinates
HV_subsoilpH$x2 <- HV_subsoilpH$x
HV_subsoilpH$y2 <- HV_subsoilpH$y

# convert data to sf object
HV_subsoilpH <- sf::st_as_sf(x = HV_subsoilpH, coords = c("x", "y"))

# grids (covariate rasters from ithir package)
hv.sub.rasters <- list.files(path = system.file("extdata/", package = "ithir"), pattern = "hunterCovariates_sub", full.names = TRUE)

# read them into R as SpatRaster objects
hv.sub.rasters <- terra::rast(hv.sub.rasters)

Back to top

Perform the covariate intersection

DSM_data <- terra::extract(x = hv.sub.rasters, y = HV_subsoilpH, bind = T, method = "simple")
DSM_data <- as.data.frame(DSM_data)

Often it is handy to check to see whether there are missing values both in the target variable and of the covariates. It is possible that a point location does not fit within the extent of the available covariates. In these cases the data should be excluded. A quick way to assess whether there are missing or NA values in the data is to use the complete.cases function.

# check for NA values
which(!complete.cases(DSM_data))

## integer(0)

DSM_data <- DSM_data[complete.cases(DSM_data), ]

There do not appear to be any missing data as indicated by the integer(0) output above i.e there are zero rows with missing information.

Back to top

Model fitting

Fitting a decision tree in R is quite similar to that for linear models. Note that below we are also performing a random hold-back of some data for an external validation.

## Fitting a decision tree model
set.seed(123)
training <- sample(nrow(DSM_data), 0.7 * nrow(DSM_data))  

hv.RT.Exp <- rpart(pH60_100cm ~ AACN + Landsat_Band1 + Elevation + Hillshading +
Mid_Slope_Positon + MRVBF + NDVI + TWI, data = DSM_data[training, ], control = rpart.control(minsplit = 50))

It is worthwhile to look at the help file for rpart particularly those aspects regarding the rpart.control parameters which control the rpart fit. Often it is helpful to just play around with the parameters to get a sense of what does what. Here for the minsplit parameter within rpart.control we are specifying that we want at least 50 observations in a node in order for a split to be attempted.

Detailed results of the model fit can be provided via the summary and printcp functions.

summary(hv.RT.Exp)

## Call:
## rpart(formula = pH60_100cm ~ AACN + Landsat_Band1 + Elevation + 
##     Hillshading + Mid_Slope_Positon + MRVBF + NDVI + TWI, data = DSM_data[training, 
##     ], control = rpart.control(minsplit = 50))
##   n= 354 
## 
##           CP nsplit rel error    xerror       xstd
## 1 0.06794766      0 1.0000000 1.0050406 0.06360731
## 2 0.03317748      2 0.8641047 1.0228037 0.06761617
## 3 0.03220587      3 0.8309272 0.9935433 0.06698589
## 4 0.02195220      6 0.7343096 0.9645777 0.06626158
## 5 0.02155232      7 0.7123574 0.9680294 0.07002215
## 6 0.01287867      8 0.6908051 0.9768363 0.07142329
## 7 0.01234328      9 0.6779264 0.9753883 0.07151362
## 8 0.01000000     10 0.6655831 0.9754643 0.07164849
## 
## Variable importance
##              AACN               TWI         Elevation             MRVBF 
##                22                18                14                13 
## Mid_Slope_Positon              NDVI       Hillshading     Landsat_Band1 
##                11                10                 9                 4 
## 
## Node number 1: 354 observations,    complexity param=0.06794766
##   mean=6.55096, MSE=1.753635 
##   left son=2 (272 obs) right son=3 (82 obs)
##   Primary splits:
##       AACN              < 43.59965   to the left,  improve=0.06379266, (0 missing)
##       Mid_Slope_Positon < 0.783947   to the left,  improve=0.06097348, (0 missing)
##       MRVBF             < 0.0166785  to the right, improve=0.05881503, (0 missing)
##       TWI               < 9.806141   to the right, improve=0.05306808, (0 missing)
##       Elevation         < 182.9988   to the left,  improve=0.03736796, (0 missing)
##   Surrogate splits:
##       Elevation         < 154.1056   to the left,  agree=0.958, adj=0.817, (0 split)
##       TWI               < 9.704844   to the right, agree=0.915, adj=0.634, (0 split)
##       MRVBF             < 0.025861   to the right, agree=0.879, adj=0.476, (0 split)
##       Hillshading       < 14.0666    to the left,  agree=0.802, adj=0.146, (0 split)
##       Mid_Slope_Positon < 0.011807   to the right, agree=0.771, adj=0.012, (0 split)
## 
## Node number 2: 272 observations,    complexity param=0.06794766
##   mean=6.367316, MSE=1.691953 
##   left son=4 (179 obs) right son=5 (93 obs)
##   Primary splits:
##       Mid_Slope_Positon < 0.7870535  to the left,  improve=0.09726051, (0 missing)
##       MRVBF             < 0.62053    to the left,  improve=0.08383424, (0 missing)
##       TWI               < 17.16774   to the left,  improve=0.07594164, (0 missing)
##       AACN              < 3.510582   to the right, improve=0.06521629, (0 missing)
##       Elevation         < 99.58687   to the right, improve=0.05214677, (0 missing)
##   Surrogate splits:
##       TWI         < 16.32056   to the left,  agree=0.897, adj=0.699, (0 split)
##       AACN        < 6.921867   to the right, agree=0.868, adj=0.613, (0 split)
##       MRVBF       < 1.858052   to the left,  agree=0.860, adj=0.591, (0 split)
##       Hillshading < 2.116992   to the right, agree=0.794, adj=0.398, (0 split)
##       Elevation   < 101.4914   to the right, agree=0.746, adj=0.258, (0 split)
## 
## Node number 3: 82 observations,    complexity param=0.02155232
##   mean=7.160122, MSE=1.475291 
##   left son=6 (26 obs) right son=7 (56 obs)
##   Primary splits:
##       NDVI        < -0.2270705 to the left,  improve=0.11059740, (0 missing)
##       MRVBF       < 0.0181425  to the right, improve=0.10021720, (0 missing)
##       Hillshading < 7.197221   to the left,  improve=0.07748013, (0 missing)
##       Elevation   < 161.3855   to the right, improve=0.06811931, (0 missing)
##       AACN        < 75.9357    to the left,  improve=0.06218024, (0 missing)
##   Surrogate splits:
##       Landsat_Band1     < 42.5       to the left,  agree=0.793, adj=0.346, (0 split)
##       Mid_Slope_Positon < 0.1866335  to the left,  agree=0.744, adj=0.192, (0 split)
##       Hillshading       < 11.70925   to the right, agree=0.720, adj=0.115, (0 split)
##       MRVBF             < 0.303978   to the right, agree=0.695, adj=0.038, (0 split)
##       TWI               < 10.23516   to the right, agree=0.695, adj=0.038, (0 split)
## 
## Node number 4: 179 observations,    complexity param=0.03317748
##   mean=6.074916, MSE=1.625561 
##   left son=8 (34 obs) right son=9 (145 obs)
##   Primary splits:
##       AACN          < 13.55327   to the left,  improve=0.07078306, (0 missing)
##       NDVI          < -0.1660675 to the left,  improve=0.04227382, (0 missing)
##       Hillshading   < 3.38458    to the left,  improve=0.03978021, (0 missing)
##       Landsat_Band1 < 52.5       to the right, improve=0.03485309, (0 missing)
##       TWI           < 15.09922   to the right, improve=0.02528861, (0 missing)
##   Surrogate splits:
##       TWI               < 14.49579   to the right, agree=0.888, adj=0.412, (0 split)
##       Elevation         < 101.5807   to the left,  agree=0.849, adj=0.206, (0 split)
##       MRVBF             < 2.628778   to the right, agree=0.849, adj=0.206, (0 split)
##       Mid_Slope_Positon < 0.7812145  to the right, agree=0.821, adj=0.059, (0 split)
##       Hillshading       < 0.9076225  to the left,  agree=0.816, adj=0.029, (0 split)
## 
## Node number 5: 93 observations,    complexity param=0.01234328
##   mean=6.930108, MSE=1.338444 
##   left son=10 (73 obs) right son=11 (20 obs)
##   Primary splits:
##       NDVI      < -0.1538785 to the left,  improve=0.06155876, (0 missing)
##       MRVBF     < 2.748072   to the left,  improve=0.04850123, (0 missing)
##       Elevation < 99.90028   to the right, improve=0.03367224, (0 missing)
##       TWI       < 16.79354   to the left,  improve=0.03088351, (0 missing)
##       AACN      < 3.648288   to the right, improve=0.02936984, (0 missing)
##   Surrogate splits:
##       Landsat_Band1 < 63.5       to the left,  agree=0.796, adj=0.05, (0 split)
## 
## Node number 6: 26 observations
##   mean=6.567308, MSE=1.515658 
## 
## Node number 7: 56 observations,    complexity param=0.01287867
##   mean=7.435357, MSE=1.217632 
##   left son=14 (37 obs) right son=15 (19 obs)
##   Primary splits:
##       Elevation     < 165.3835   to the right, improve=0.11724900, (0 missing)
##       TWI           < 9.119365   to the left,  improve=0.06946874, (0 missing)
##       MRVBF         < 0.016642   to the right, improve=0.05460568, (0 missing)
##       AACN          < 59.41751   to the right, improve=0.04961327, (0 missing)
##       Landsat_Band1 < 51         to the right, improve=0.03827830, (0 missing)
##   Surrogate splits:
##       AACN        < 59.41751   to the right, agree=0.946, adj=0.842, (0 split)
##       TWI         < 9.690168   to the left,  agree=0.768, adj=0.316, (0 split)
##       Hillshading < 3.093062   to the right, agree=0.750, adj=0.263, (0 split)
##       MRVBF       < 0.145624   to the left,  agree=0.732, adj=0.211, (0 split)
##       NDVI        < -0.1460055 to the left,  agree=0.714, adj=0.158, (0 split)
## 
## Node number 8: 34 observations
##   mean=5.374412, MSE=1.01446 
## 
## Node number 9: 145 observations,    complexity param=0.03220587
##   mean=6.239172, MSE=1.626812 
##   left son=18 (120 obs) right son=19 (25 obs)
##   Primary splits:
##       TWI       < 13.928     to the left,  improve=0.07823863, (0 missing)
##       MRVBF     < 0.62053    to the left,  improve=0.06765783, (0 missing)
##       NDVI      < -0.161967  to the left,  improve=0.04928153, (0 missing)
##       Elevation < 142.9022   to the right, improve=0.03381342, (0 missing)
##       AACN      < 21.32139   to the right, improve=0.02651395, (0 missing)
##   Surrogate splits:
##       AACN        < 17.50792   to the right, agree=0.883, adj=0.32, (0 split)
##       MRVBF       < 1.790583   to the left,  agree=0.876, adj=0.28, (0 split)
##       Elevation   < 107.4445   to the right, agree=0.855, adj=0.16, (0 split)
##       Hillshading < 2.147379   to the right, agree=0.834, adj=0.04, (0 split)
## 
## Node number 10: 73 observations
##   mean=6.779863, MSE=1.317859 
## 
## Node number 11: 20 observations
##   mean=7.4785, MSE=1.030453 
## 
## Node number 14: 37 observations
##   mean=7.164595, MSE=1.108441 
## 
## Node number 15: 19 observations
##   mean=7.962632, MSE=1.009483 
## 
## Node number 18: 120 observations,    complexity param=0.03220587
##   mean=6.076333, MSE=1.468697 
##   left son=36 (47 obs) right son=37 (73 obs)
##   Primary splits:
##       Landsat_Band1 < 52.5       to the right, improve=0.09342719, (0 missing)
##       NDVI          < -0.1709515 to the left,  improve=0.06335787, (0 missing)
##       Hillshading   < 3.976624   to the left,  improve=0.05009109, (0 missing)
##       MRVBF         < 0.629809   to the left,  improve=0.04625149, (0 missing)
##       AACN          < 29.02258   to the left,  improve=0.04051094, (0 missing)
##   Surrogate splits:
##       NDVI        < -0.2207    to the right, agree=0.675, adj=0.170, (0 split)
##       AACN        < 18.27993   to the left,  agree=0.658, adj=0.128, (0 split)
##       Elevation   < 104.9371   to the left,  agree=0.642, adj=0.085, (0 split)
##       MRVBF       < 0.0307535  to the left,  agree=0.642, adj=0.085, (0 split)
##       Hillshading < 9.895688   to the right, agree=0.625, adj=0.043, (0 split)
## 
## Node number 19: 25 observations
##   mean=7.0208, MSE=1.647543 
## 
## Node number 36: 47 observations
##   mean=5.614681, MSE=0.744344 
## 
## Node number 37: 73 observations,    complexity param=0.03220587
##   mean=6.373562, MSE=1.7095 
##   left son=74 (50 obs) right son=75 (23 obs)
##   Primary splits:
##       NDVI              < -0.176242  to the left,  improve=0.20079140, (0 missing)
##       Hillshading       < 3.976624   to the left,  improve=0.13627740, (0 missing)
##       MRVBF             < 0.6187865  to the left,  improve=0.03184879, (0 missing)
##       Mid_Slope_Positon < 0.6320945  to the right, improve=0.02794618, (0 missing)
##       Elevation         < 142.8409   to the right, improve=0.01737568, (0 missing)
##   Surrogate splits:
##       Elevation < 117.1724   to the right, agree=0.699, adj=0.043, (0 split)
## 
## Node number 74: 50 observations,    complexity param=0.0219522
##   mean=5.9762, MSE=1.051936 
##   left son=148 (26 obs) right son=149 (24 obs)
##   Primary splits:
##       Hillshading   < 5.843943   to the left,  improve=0.25909640, (0 missing)
##       Landsat_Band1 < 47.5       to the left,  improve=0.11351230, (0 missing)
##       AACN          < 29.47343   to the left,  improve=0.10966650, (0 missing)
##       MRVBF         < 0.478392   to the left,  improve=0.10308040, (0 missing)
##       NDVI          < -0.295995  to the left,  improve=0.08545253, (0 missing)
##   Surrogate splits:
##       Elevation         < 128.4046   to the left,  agree=0.72, adj=0.417, (0 split)
##       TWI               < 9.897047   to the right, agree=0.68, adj=0.333, (0 split)
##       AACN              < 33.61438   to the left,  agree=0.66, adj=0.292, (0 split)
##       MRVBF             < 0.0682475  to the right, agree=0.64, adj=0.250, (0 split)
##       Mid_Slope_Positon < 0.6506405  to the right, agree=0.62, adj=0.208, (0 split)
## 
## Node number 75: 23 observations
##   mean=7.237391, MSE=2.049532 
## 
## Node number 148: 26 observations
##   mean=5.474615, MSE=0.2419556 
## 
## Node number 149: 24 observations
##   mean=6.519583, MSE=1.361596

The summary output provides detailed information of the data splitting as well as information as to the relative importance of the covariates.

printcp(hv.RT.Exp)

## 
## Regression tree:
## rpart(formula = pH60_100cm ~ AACN + Landsat_Band1 + Elevation + 
##     Hillshading + Mid_Slope_Positon + MRVBF + NDVI + TWI, data = DSM_data[training, 
##     ], control = rpart.control(minsplit = 50))
## 
## Variables actually used in tree construction:
## [1] AACN              Elevation         Hillshading       Landsat_Band1    
## [5] Mid_Slope_Positon NDVI              TWI              
## 
## Root node error: 620.79/354 = 1.7536
## 
## n= 354 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.067948      0   1.00000 1.00504 0.063607
## 2 0.033177      2   0.86410 1.02280 0.067616
## 3 0.032206      3   0.83093 0.99354 0.066986
## 4 0.021952      6   0.73431 0.96458 0.066262
## 5 0.021552      7   0.71236 0.96803 0.070022
## 6 0.012879      8   0.69081 0.97684 0.071423
## 7 0.012343      9   0.67793 0.97539 0.071514
## 8 0.010000     10   0.66558 0.97546 0.071648

The printcp function provides the useful output of indicating which covariates were included in the final model. For the visually inclined, a plot of the tree assists a lot to interpret the model diagnostics and assessing the important covariates too.

plot(hv.RT.Exp)
text(hv.RT.Exp)
rconsole
Fitted decision tree model based on Hunter Valley soil pH (60-100cm) data.

Back to top

Cross-validation

As before, we can use the goof function to evaluate the performance of the model fit both internally and externally.

## Goodness of fit evaluation 
# Internal validation
RT.pred.C <- predict(hv.RT.Exp, DSM_data[training, ])
goof(observed = DSM_data$pH60_100cm[training], predicted = RT.pred.C)

##          R2 concordance     MSE     RMSE bias
## 1 0.3344169   0.4998021 1.16719 1.080366    0

# External validation
RT.pred.V <- predict(hv.RT.Exp, DSM_data[-training, ])
goof(observed = DSM_data$pH60_100cm[-training], predicted = RT.pred.V)

##          R2 concordance      MSE    RMSE       bias
## 1 0.1537061   0.3426565 1.690599 1.30023 0.04590521
rconsole
Decision tree xy-plot plot of predicted soil pH (60-100cm) (validation data set).

The decision tree model performance is not too dissimilar to the MLR model. Looking at the xy-plot from the external validation above and the decision tree, it becomes clear that a potential issue is apparent. This is, there are only a finite number of possible outcomes in terms of the predictions.

Back to top

Spatial prediction

This finite property becomes obviously apparent once we make a map by applying the fitted model to the covariates (using the terra::predict function and hv.sub.rasters object).

## Create the map
map.RT.r1 <- terra::predict(object = hv.sub.rasters, model = hv.RT.Exp, filename = "soilpH_60_100_RT.tif", datatype = "FLT4S", overwrite = TRUE)

plot(map.RT.r1, main = "Decision tree predicted Hunter Valley soil pH (60-100cm")
rconsole
Decision tree predicted Hunter Valley soil pH (60-100cm).

Back to top

Updated on Smart Digital Agriculture