Code is here
Exploratory data analysis
We will again use the
soil.data set to demonstrate calculation of
summary statistics. Just for recall, lets see what is in this data
library(ithir) data(USYD_soil1) soil.data <- USYD_soil1 names(soil.data) ##  "PROFILE" "Landclass" "Upper.Depth" "Lower.Depth" "clay" ##  "silt" "sand" "pH_CaCl2" "Total_Carbon" "EC" ##  "ESP" "ExchNa" "ExchK" "ExchCa" "ExchMg" ##  "CEC"
Here are some useful functions (and note the usage of the
argument) for calculation of means (
mean), medians (
standard deviations (
sd) and variances (
mean(soil.data$clay, na.rm = TRUE) ##  26.95302 median(soil.data$clay, na.rm = TRUE) ##  21 sd(soil.data$clay, na.rm = TRUE) ##  15.6996 var(soil.data$clay, na.rm = TRUE) ##  246.4775
R has a built-in function for summarizing vectors or data frames
summary. This function is a generic function—what it returns is
dependent on the type of data set to it. Applying the
to the first 6 columns in the
soil.data set results in the following
summary(soil.data[, 1:6]) ## PROFILE Landclass Upper.Depth Lower.Depth ## Min. : 1.00 Cropping :49 Min. :0.0000 Min. :0.0200 ## 1st Qu.: 8.00 Forest :50 1st Qu.:0.0200 1st Qu.:0.0500 ## Median :15.00 improved pasture:35 Median :0.0500 Median :0.1000 ## Mean :14.73 native pasture :32 Mean :0.1816 Mean :0.2464 ## 3rd Qu.:22.00 3rd Qu.:0.2000 3rd Qu.:0.3000 ## Max. :29.00 Max. :0.7000 Max. :0.8000 ## ## clay silt ## Min. : 5.00 Min. : 6.0 ## 1st Qu.:15.00 1st Qu.:11.0 ## Median :21.00 Median :15.0 ## Mean :26.95 Mean :16.5 ## 3rd Qu.:37.00 3rd Qu.:20.0 ## Max. :68.00 Max. :32.0 ## NA's :17 NA's :1
Notice the difference between numerical and categorical variables. The
summary function should probably be your first stop after organizing
your data, and before analyzing it as it provides an easy way to check
for wildly erroneous values.
Histograms and box plots
Box plots and histograms are simple but useful ways of summarizing data.
You can generate a histogram in
R using the function
The histogram above can be made to look nicer, by applying some of the
plotting parameters or arguments that were covered in Part 4 about the
plot function. There are also some additional plotting arguments
that can be sourced in the
hist help file. One of these is the ability
to specify the number or location of breaks in the histogram.
Box plots are also a useful way to summarize data. We can use it simply,
for example, summarize the clay content in the
By default, the heavy line shows the median, the box shows the 25t**h and 75t**h percentiles, the whiskers show the extreme values, and points show outliers beyond these.
Another approach is to plot a single variable by some factor. Here we
boxplot(Total_Carbon ~ Landclass, data = soil.data)
Note the use of the tilde symbol
∼ in the above command. The code
Landclass is analogous to a model formula in this
case, and simply indicates that
Total_Carbon is described by
Landclass and should be split up based on the category of this
variable. We will see more of this character with the specification of
soil spatial prediction functions later on.
Normal quantile and cumulative probability plots
One way to assess the normality of the distribution of a given variable is with a quantile-quantile plot. This plot shows data values vs. quantiles based on a normal distribution.
qqnorm(soil.data$Total_Carbon, plot.it = TRUE, pch = 4, cex = 0.7) qqline(soil.data$Total_Carbon, col = "red", lwd = 2)
There definitely seems to be some deviation from normality here. This is not unusual for soil carbon information. It is common (in order to proceed with statistical modelling) to perform a transformation of sorts in order to get these data to conform to a normal distribution. Often a log transformation will be sutiable.
qqnorm(log(soil.data$Total_Carbon), plot.it = TRUE, pch = 4, cex = 0.7) qqline(log(soil.data$Total_Carbon), col = "red", lwd = 2)
Finally, another useful data exploratory tool is quantile calculations.
R will return the quantiles of a given data set with the
function. Note that there are nine different algorithms available for
doing this. You can find descriptions in the help file for
By default, the quartiles of a data distribution are given when using
this function, but any quantiles can be estimated using the
quantile(soil.data$Total_Carbon, na.rm = TRUE) ## 0% 25% 50% 75% 100% ## 0.09 0.39 1.05 1.60 12.74 quantile(soil.data$Total_Carbon, na.rm = TRUE, probs = seq(0, 1, 0.05)) ## 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% ## 0.090 0.170 0.230 0.270 0.328 0.390 0.502 0.604 0.730 0.866 1.050 ## 55% 60% 65% 70% 75% 80% 85% 90% 95% 100% ## 1.150 1.268 1.448 1.548 1.600 1.762 2.026 2.928 4.494 12.740 quantile(soil.data$Total_Carbon, na.rm = TRUE, probs = seq(0.9, 1, 0.01)) ## 90% 91% 92% 93% 94% 95% 96% 97% 98% 99% ## 2.9280 3.3208 3.9128 4.0152 4.2388 4.4940 5.5920 6.4488 7.0536 9.8344 ## 100% ## 12.7400
soil.dataset firstly determine the summary statistics for each of the numerical or quantitative variables. You want to calculate things like maximum, minimum, mean, median, standard deviation, and variance. There are a couple of ways to do this. However, put all the results into a data frame and export as a text file.
Generate histograms and QQ plots for each of the quantitative variables. Do any need some sort of transformation so that their distribution is normal. If so, do the transformation and perform the plots again.