Multiple Linear Regression

CODE:

Multiple linear regression model applied to digital soil mapping

Introduction
Data preparation
Perform the covariate intersection
Model fitting (full model)
Model fitting (step-wise covariate selection)
Cross-validation
Applying the model spatially
Other possible spatial predictions (prediction intervals)

Introduction

Multiple linear regression (MLR) is where we regress a target variable against more than one covariate. In terms of soil spatial prediction functions, MLR is a least-squares model whereby we want to to predict a continuous soil variable from a suite of covariates.

There are a couple of ways to go about this. We could just put everything (all the covariates) in the model and then fit it (estimate the model parameters).

We could also perform a step-wise regression model where we only enter variables that are statistically significant, based on some selection criteria.

Alternatively we could fit what could be termed, an ‘expert’ model, such that based on some pre-determined knowledge of the soil variable we are trying to model, we include covariates that best describe this knowledge. In some ways this is a biased model because we really don’t know everything about (the spatial characteristics) the soil property under investigation. Yet in many situations it is better to rely on expert knowledge that is gained in the field as opposed to a purely data driven approach.

Data preparation

So lets firstly get the data organized. Recall from before in the data preparatory exercises that we were working with the soil point data and environmental covariates for the Hunter Valley area. These data are stored in the HV_subsoilpH object and supplied raster files from the ithir package.

For the succession of models examined in these various pages, we will concentrate on modelling and mapping the soil pH for the 60-100cm depth interval. To refresh, lets load the data in, then intersect the data with the available covariates.

# load libraries
library(ithir)
library(terra)
library(sf)
library(viridis)

# point data
data(HV_subsoilpH)

# Start afresh round pH data to 2 decimal places
HV_subsoilpH$pH60_100cm <- round(HV_subsoilpH$pH60_100cm, 2)

# remove already intersected data
HV_subsoilpH <- HV_subsoilpH[, 1:3]

# add an id column
HV_subsoilpH$id <- seq(1, nrow(HV_subsoilpH), by = 1)

# re-arrange order of columns
HV_subsoilpH <- HV_subsoilpH[, c(4, 1, 2, 3)]

# Change names of coordinate columns
names(HV_subsoilpH)[2:3] <- c("x", "y")
# save a copy of coordinates
HV_subsoilpH$x2 <- HV_subsoilpH$x
HV_subsoilpH$y2 <- HV_subsoilpH$y

# convert data to sf object
HV_subsoilpH <- sf::st_as_sf(x = HV_subsoilpH, coords = c("x", "y"))

# grids (covariate rasters from ithir package)
hv.sub.rasters <- list.files(path = system.file("extdata/", package = "ithir"), pattern = "hunterCovariates_sub", full.names = TRUE)

# read them into R as SpatRaster objects
hv.sub.rasters <- terra::rast(hv.sub.rasters)

Perform the covariate intersection

DSM_data <- terra::extract(x = hv.sub.rasters, y = HV_subsoilpH, bind = T, method = "simple")
DSM_data <- as.data.frame(DSM_data)

Often it is handy to check to see whether there are missing values both in the target variable and of the covariates. It is possible that a point location does not fit within the extent of the available covariates. In these cases the data should be excluded. A quick way to assess whether there are missing or NA values in the data is to use the complete.cases function.

which(!complete.cases(DSM_data))

## integer(0)

DSM_data <- DSM_data[complete.cases(DSM_data), ]

There do not appear to be any missing data as indicated by the integer(0) output above i.e there are zero rows with missing information.

Model fitting (full model)

With the soil point data prepared, lets fit a model with everything in it (all covariates) to get an idea of how to parameterise the MLR models in R. Remember the soil variable we are making a model for is soil pH for the 60-100cm depth interval.

hv.MLR.Full <- lm(pH60_100cm ~ +Terrain_Ruggedness_Index + AACN + Landsat_Band1 +
Elevation + Hillshading + Light_insolation + Mid_Slope_Positon + MRVBF + NDVI +
TWI + Slope, data = DSM_data)
summary(hv.MLR.Full)

## 
## Call:
## lm(formula = pH60_100cm ~ +Terrain_Ruggedness_Index + AACN + 
##     Landsat_Band1 + Elevation + Hillshading + Light_insolation + 
##     Mid_Slope_Positon + MRVBF + NDVI + TWI + Slope, data = DSM_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2380 -0.7843 -0.1225  0.7057  3.4641 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               5.372451   2.147673   2.502 0.012689 *  
## Terrain_Ruggedness_Index  0.075084   0.054893   1.368 0.171991    
## AACN                      0.034747   0.007241   4.798 2.12e-06 ***
## Landsat_Band1            -0.037712   0.009355  -4.031 6.42e-05 ***
## Elevation                -0.013535   0.005550  -2.439 0.015079 *  
## Hillshading               0.152819   0.053655   2.848 0.004580 ** 
## Light_insolation          0.001329   0.001178   1.127 0.260081    
## Mid_Slope_Positon         0.928823   0.268625   3.458 0.000592 ***
## MRVBF                     0.324041   0.084942   3.815 0.000154 ***
## NDVI                      4.982413   0.887322   5.615 3.28e-08 ***
## TWI                       0.085150   0.045976   1.852 0.064615 .  
## Slope                    -0.102262   0.062391  -1.639 0.101838    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.178 on 494 degrees of freedom
## Multiple R-squared:  0.2501, Adjusted R-squared:  0.2334 
## F-statistic: 14.97 on 11 and 494 DF,  p-value: < 2.2e-16

Model fitting (step-wise covariate selection)

From the summary output above, it seems a few of the covariates are significant in describing the spatial variation of the target variable. To derive a parsimonious model we could perform a step-wise regression using the step function. With this function we can also specify what direction we want step wise algorithm to proceed.

hv.MLR.Step <- step(hv.MLR.Full, trace = 0, direction = "both")
summary(hv.MLR.Step)

## 
## Call:
## lm(formula = pH60_100cm ~ AACN + Landsat_Band1 + Elevation + 
##     Hillshading + Mid_Slope_Positon + MRVBF + NDVI + TWI, data = DSM_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1202 -0.8055 -0.1286  0.7443  3.4407 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        7.449369   0.930288   8.008 8.36e-15 ***
## AACN               0.037413   0.006986   5.356 1.31e-07 ***
## Landsat_Band1     -0.037795   0.009134  -4.138 4.12e-05 ***
## Elevation         -0.012042   0.005299  -2.273 0.023481 *  
## Hillshading        0.089275   0.018576   4.806 2.04e-06 ***
## Mid_Slope_Positon  0.982066   0.263538   3.726 0.000216 ***
## MRVBF              0.307179   0.083361   3.685 0.000254 ***
## NDVI               5.111642   0.882036   5.795 1.21e-08 ***
## TWI                0.092169   0.045241   2.037 0.042149 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.179 on 497 degrees of freedom
## Multiple R-squared:  0.245,  Adjusted R-squared:  0.2329 
## F-statistic: 20.16 on 8 and 497 DF,  p-value: < 2.2e-16

Comparing the outputs of both the full and step-wise MLR models, there is very little difference in the model diagnostics such as the R². Both models explain about 25% of variation of the target variable.

The ‘full’ model is more complex as it has more parameters than the ‘step’ model. If we apply Occam’s Razor philosophy, the ‘step’ model is preferable.

Cross-validation

As described earlier in the cross-validation page, it is more acceptable to test the performance of a model based upon an external or out-of-bag model evaluation.

Lets fit a new model using the covariates selected in the step wise regression to a random subset of the available data. We will sample 70% of the available rows for the model calibration data set.

set.seed(123)
training <- sample(nrow(DSM_data), 0.7 * nrow(DSM_data))
hv.MLR.rh <- lm(pH60_100cm ~ AACN + Landsat_Band1 + Elevation + Hillshading + Mid_Slope_Positon +
MRVBF + NDVI + TWI, data = DSM_data[training, ])
# calibration predictions
hv.pred.rhC <- predict(hv.MLR.rh, DSM_data[training, ])

# apply predictions to out-of-bag cases
hv.pred.rhV <- predict(hv.MLR.rh, DSM_data[-training, ])

Now we can evaluate the test statistics of the calibration model using the ithir::goof function.

# calibration
goof(observed = DSM_data$pH60_100cm[training], predicted = hv.pred.rhC)

##          R2 concordance      MSE     RMSE          bias
## 1 0.2380964   0.3835304 1.336101 1.155898 -7.105427e-15

# validation
goof(observed = DSM_data$pH60_100cm[-training], predicted = hv.pred.rhV)

##          R2 concordance      MSE     RMSE       bias
## 1 0.2419371   0.3791514 1.445582 1.202323 0.06681039

In this situation the calibration model does not appear to be over fitting because the evaluations for the out-of-bag cases are similar to those of the calibration data. While this is a good result, the prediction model performs only moderately well by the fact there is a noticeable deviation between observations and corresponding model predictions. Examining other candidate models is a way to try to improve upon these results.

Applying the model spatially

There are a few ways to go about applying a model spatially, but here we will just use the terra::predict function.

In order to eliminate any errors it is necessary to have all the predictive covariates arranged as a stacked SpatRaster object. This is useful as it also ensures all the rasters are of the same extent and resolution. Here we can use the terra::predict function such as below using the hv.sub.rasters loaded earlier.

map.MLR.r1 <- terra::predict(object = hv.sub.rasters, model = hv.MLR.rh, filename = "soilpH_60_100_MLR.tif", datatype = "FLT4S", overwrite = TRUE)
# check working directory for presence of raster

The useful feature of the terra::predict function in this context is that we can write to file directly. This entails leveraging some of the parameters used within the writeRaster function that are worth noting. The parameter datatype is specified as FLT4S which indicates that a 4 byte, signed floating point values are to be written to file. Look at the function dataType to look at other alternatives, for example for categorical data where we may be interested in logical or integer values.

rconsole — MLR predicted soil pH 60-100cm across the Hunter Valley.

Other possible spatial predictions (prediction intervals)

The prediction function is quite versatile. For example we can also map the standard error of prediction or the confidence interval or the prediction interval even.

The script below is an example of creating maps of the 90% prediction intervals for the model. We need to explicitly create a function called in this case predfun which will direct the terra::predict function to output the predictions plus the upper and lower prediction limits. In the predict function we insert predfun for the fun parameter and control the output by changing the index value to either 1, 2, or 3 to request either the prediction, lower limit, upper limit respectively. Setting the level parameter to 0.90 indicates that we want to return the 90% prediction interval.

# prediction interval function
predfun <- function(model, data) {
v <- terra::predict(model, data, interval = "prediction", level = 0.9)}

# perform the predictions lower prediction limit
map.MLR.r.1ow <- terra::predict(object = hv.sub.rasters, model = hv.MLR.rh, filename = "soilPh_60_100_MLR_low.tif", fun = predfun, index = 2, overwrite = TRUE)

# mean prediction
map.MLR.r.pred <- predict(object = hv.sub.rasters, model = hv.MLR.rh, filename = "soilPh_60_100_MLR_pred.tif", fun = predfun, index = 1, overwrite = TRUE)

# upper prediction limit
map.MLR.r.up <- predict(object = hv.sub.rasters, model = hv.MLR.rh, filename = "soilPh_60_100_MLR_pred.tif", fun = predfun, index = 3, overwrite = TRUE)
# check working directory for presence of rasters

## plot map on the same scale for better comparison stack
stacked.preds <- c(map.MLR.r.1ow, map.MLR.r.pred, map.MLR.r.up)
stacked.preds

rmin <- min(terra::minmax(stacked.preds))
rmax <- max(terra::minmax(stacked.preds))

# plotting parameter
par(mfrow = c(3, 1))

plot(stacked.preds[[1]], col = viridis(10), breaks = seq(rmin, rmax, length.out = 10),
legend = T, bty = "n", box = FALSE, main = "MLR predicted soil pH (60-100cm) lower limit")

plot(stacked.preds[[2]], col = viridis(10), breaks = seq(rmin, rmax, length.out = 10),
legend = T, bty = "n", box = FALSE, main = "MLR predicted soil pH (60-100cm) mean prediction")

plot(stacked.preds[[3]], col = viridis(10), breaks = seq(rmin, rmax, length.out = 10),
legend = T, bty = "n", box = FALSE, main = "MLR predicted soil pH (60-100cm) upper limit")