-
Notifications
You must be signed in to change notification settings - Fork 8
/
ML_AOA_withSolution.Rmd
342 lines (236 loc) · 13.3 KB
/
ML_AOA_withSolution.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
---
title: "Mapping the area of applicability (AOA) of prediction models"
subtitle: "Questions to OpenGeoHub Summer school 2021 participants"
author: "Hanna Meyer"
date: "08/18/2021"
output:
html_document:
toc: true
toc_float: true
toc_depth: 3
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Introduction
This exercise should help you to get a better understanding of how the AOA estimation works, why it is important and what you might want to consider in order to develop models that are applicable for their respective purpose.
# Prediction task
The task is to perfom a supervised land cover classification for Münster in Germany.
The challenge is: There are no reference data available...at least none for training, you'll get some later for validation ;-)
Instead, for model training reference data from a few regions across Germany are provided (thanks to my Master Landscape ecology group at WWU!). This is a realistic scenario looking at the large variety of prediction maps e.g. in the context of ecology, where predictions are usually made far beyond training samples.
```{r start, message=FALSE, warning=FALSE,echo=FALSE}
#First, load a number of packages we'll need.
library(raster)
library(caret)
library(mapview)
library(sf)
#library(devtools)
#install_github("HannaMeyer/CAST")
library(CAST)
#additional required packages:
library(tmap)
library(latticeExtra)
library(doParallel)
library(parallel)
library(Orcs)
```
## Data
To start with, let's load and explore the data...
### Raster data (predictor variables)
As predictor variables for spatial mapping of the area of Münster, you have a Sentinel-2 image available with some derived artificial channels (NDVI, standard deviations of NDVI in a 5x5 Pixel environment, Latitude, Longitude)
```{r load, message=FALSE,warning=FALSE}
sen_ms <- stack("data/Sen_Muenster_sub.grd")
```
```{r rgbplots,echo=FALSE}
rgbplot_ms <- spplot(sen_ms[[1]], col.regions="transparent",sp.layout =rgb2spLayout(sen_ms[[3:1]], quantiles = c(0.02, 0.98), alpha = 1))
```
Let's plot the rasterStack to get an idea how the variables look like.
```{r visPredictors}
plot(sen_ms)
plotRGB(sen_ms,stretch="lin",r=3,g=2,b=1)
```
### Reference data
As reference data we have digitized and labeled training polygons for some spatially distinct regions across Germany. Sentinel-2 data have been extracted for the training polygons and each pixel covered by a polygon is used as potential training data point.
We start with splitting the data into validation (Münster) and training (other regions) data.
```{r loadPoly}
trainSites <- readRDS("data/data_combined_ll.RDS")
trainDat <- trainSites[trainSites$Region!="Muenster",]
validationDat <- trainSites[trainSites$Region=="Muenster",]
head(trainSites)
#see unique regions in train set:
unique(trainDat$Region)
```
### Predictors and response
In order to speed things up, for this example we will reduce the data. Therefore, from each training polygon only 15% of the pixels will be used for model training.
```{r subset}
set.seed(100)
trainids <- createDataPartition(trainDat$ID,list=FALSE,p=0.15)
trainDat <- trainDat[trainids,]
trainDat <- trainDat[complete.cases(trainDat),]
```
For model training we need to define the predictor and response variables. To start with, as predictors we simply use basically all information from the raster stack without further consideration. As response variable we use the "Label" column of the data frame.
```{r vars}
predictors <- names(sen_ms)
response <- "Label"
```
## Model training and validation
We then train a random forest algorithm to learn relationships between the predictors and the land cover.
As a first naive approach we use the default modelling way with a default random cross-validation.
```{r train_basic, warning=FALSE, message=FALSE}
# train the model
ctrl_default <- trainControl(method="cv", number = 3, savePredictions = TRUE)
set.seed(100)
model <- train(trainDat[,predictors],
trainDat[,response],
method="rf",
metric="Kappa",
trControl=ctrl_default,
importance=TRUE,
ntree=50)
model
```
Looking at the performance it looks great, apparently a nearly perfect prediction model (keep in mind Kappa can take values up to 1 = perfect fit).
## Model prediction
To do the classification we can then use the trained model and apply it to each pixel of the raster stack using the predict function.
```{r predict, message=FALSE, warning=FALSE}
prediction <- predict(sen_ms,model)
```
```{r vispredict, message=FALSE, warning=FALSE, echo=FALSE}
prediction <- predict(sen_ms,model)
cols <- rev(c("blue","red","darkorange","lightgreen","brown","green", "grey","green2","forestgreen","darkgreen"))
tm_shape(deratify(prediction)) +
tm_raster(palette = cols,title = "LUC")+
tm_scale_bar(bg.color="white",bg.alpha=0.75)+
tm_layout(legend.bg.color = "white",
legend.bg.alpha = 0.75)
```
However, quite obviously, the result doesn't look very good compared to the RGB. Very obvious is that the city of Münster is mainly predicted as "open Soil". Why is this possible keeping in mind that the cross-validation results indicated a nearly perfect model?
### Can the model predict beyond training regions?
The reason is that we apply the model to make predictions for a new region. But the random CV doesn't help to estimate the performance for Münster (or other regions). This is because the data are heavily clustered in space (i.e. several data points from the same training polygon, see explanation in the recordings of the previous summer schools), hence random CV is only helpful to answer the question how well the model can make predictions for the same cluster as used during training.
To get a more suitable performance estimate, we re-train the model with spatial CV. Since we want to apply the model to a new region within Germany, an obvious way of splitting the data is to split by region: In each iteration of cross-validation, one region is left out and used for validation instead of training.
```{r train_sp, warning=FALSE, message=FALSE, results='hide'}
# train the model
indices <- CreateSpacetimeFolds(trainDat,spacevar = "Region",k=length(unique(trainDat$Region)))
ctrl_sp <- trainControl(method="cv", index = indices$index, savePredictions = TRUE)
set.seed(100)
model_sp <- train(trainDat[,predictors],
trainDat[,response],
method="rf",
metric="Kappa",
trControl=ctrl_sp,
importance=TRUE,
ntree=50)
```
Let's compare the performance estimates.
```{r compare_performance, warning=FALSE, message=FALSE, results='hide'}
boxplot(model$resample$Kappa, model_sp$resample$Kappa,
names=c("random CV", "spatial CV"),ylab="Kappa Index")
```
Based on the spatial CV, we expect the performance for Münster to be considerably lower compared to the random CV (Kappa around 0.4 to 0.6).
### Validation with external validation data of Münster
We see large differences between random and spatial CV. Recently it has been argued against spatial CV for estimating map accuracy (https://www.sciencedirect.com/science/article/abs/pii/S0304380021002489?via%3Dihub). However, as outlined above, a spatial CV is the only meaningful way to estimate the performance for clustered data if no probability sample is available. Let's test if the spatial CV estimates are reasonable: We now use the validation data and validate the predictions with these external data.
```{r definition, echo = FALSE}
validation <- function(x,y=validationDat){
confusionMatrix(factor(x,levels=unique(unique(as.character(x),unique(as.character(y$Label))))),factor(y$Label,unique(unique(as.character(x),unique(as.character(y$Label))))))$overall[1:2]
}
```
```{r validation, warning=FALSE, message=FALSE}
pred_muenster_valid <- predict(model_sp,validationDat)
head(pred_muenster_valid)
head(validationDat$Label)
validation(pred_muenster_valid)
```
Seems like the external validation is well comparable to the spatial CV estimate (and as expected not at all comparable to the random CV estimate).
## Area of Applicability
Technically it was no problem to make predictions for Münster. But let's assess if we really should have done this. A trained model should only be applied to locations that feature predictor properties that are comparable to those of the training data. Read https://besjournals.onlinelibrary.wiley.com/doi/10.1111/2041-210X.13650 to get familiar with the idea. If dissimilarity to the training data is larger than the dissimilarity within the training data, Meyer and Pebesma (2021) suggest that the model should not be applied to this location.
The calculation of the AOA is quite time consuming. To make a bit faster we use a parallelization.
```{r aoa}
cl <- makeCluster(4)
registerDoParallel(cl)
AOA <- aoa(sen_ms,model,cl=cl)
plot(AOA)
```
We see that the AOA has only values of 0 which means that no pixel falls inside the AOA. Now we can visualize the prediction for the AOA only.
```{r aoavis, echo=FALSE}
cols_aoa <- c("black","transparent")
if (sum(values(AOA$AOA))==0){
cols_aoa <- c("black")
}
predplot <- spplot(deratify(prediction),col.regions=cols, main = list(label="Prediction (left), prediction only for the AOA (right) and RGB composite (bottom)",cex=0.8))
predplotaoa <- spplot(deratify(prediction),col.regions=cols)+
spplot(AOA$AOA,col.regions=cols_aoa)
latticeCombineGrid(list(predplot,predplotaoa,rgbplot_ms),layout=c(2,2))
```
We see that no predictions are shown because the model is not considered applicable to any part of the area of Münster.
### Percentage of AOA
As a starting point for further improvement, let's calculate the percentage of the AOA for Münster.
```{r aoa_performance,echo=FALSE}
print(paste0("Percentage of Münster that is within the AOA: ",
round(sum(values(AOA$AOA)==1)/ncell(AOA),2)*100
," %"))
```
# Problem to be solved
Obviously, the model was not applicable to any part of Münster. The prediction patterns confirm that the model cannot make any reliable predictions here. Now it's up to you to solve this.
# Sample solution
To find explanations for the patters, let's look at the variable importance within the model.
```{r varimp}
plot(varImp(model))
```
It's interesting that latitude and longitude are so important here. However, we would not expect that these are meaningful predictors in order to predict beyond training regions (see discussions at previous summer schools). If this is the case the solution is quite easy: Removing these variables should improve the spatial model performance and should also increase the AOA because it will lead to a model that is less overfitted to the training data.
So let's do a spatial variable selection that is selecting a combination of predictor variables that lead to the best leave-region-out model performance. In this way we expect that that we can train a model that is as generalized as possible to be applied to a new region.
```{r train_ffs, warning=FALSE, message=FALSE, results='hide'}
#ffs
set.seed(100)
model_ffs <- ffs(trainDat[,predictors],
trainDat[,response],
method="rf",
metric="Kappa",
trControl=ctrl_sp,
importance=TRUE,
ntree=50)
plot_ffs(model_ffs)
plot_ffs(model_ffs,plotType = "selected")
```
Lets also have a look at the estimated spatial cross-validation performance in comparison to the original model
```{r cv_ffs, warning=FALSE, message=FALSE, results='hide'}
boxplot(model_sp$resample$Kappa,model_ffs$resample$Kappa,
names=c("original model", "after variable selection"),ylab="Kappa Index")
```
Next, we make predictions for Münster.
```{r vispredict_ffs, message=FALSE, warning=FALSE, echo=FALSE}
prediction_ffs <- predict(sen_ms,model_ffs)
cols <- rev(c("blue","red","darkorange","lightgreen","brown","green", "grey","green2","forestgreen","darkgreen"))
tm_shape(deratify(prediction_ffs)) +
tm_raster(palette = cols,title = "LUC")+
tm_scale_bar(bg.color="white",bg.alpha=0.75)+
tm_layout(legend.bg.color = "white",
legend.bg.alpha = 0.75)
```
Let's then estimate the AOA of the new model.
```{r aoa_ffs, warning=FALSE, message=FALSE, results='hide'}
cl <- makeCluster(4)
registerDoParallel(cl)
AOA_ffs <- aoa(sen_ms,model_ffs,cl=cl)
plot(AOA_ffs)
```
Let's visualize the results:
```{r vis_ffs, warning=FALSE, message=FALSE, echo=FALSE}
predplot <- spplot(deratify(prediction_ffs),col.regions=cols, main = list(label="Prediction (left), prediction only for the AOA (right) and RGB composite (bottom)",cex=0.8))
cols_aoa <- c("black","transparent")
if (sum(values(AOA_ffs$AOA))==0){
cols_aoa <- c("black")
}
predplotaoa <- spplot(deratify(prediction_ffs),col.regions=cols)+
spplot(AOA_ffs$AOA,col.regions=cols_aoa)
latticeCombineGrid(list(predplot,predplotaoa,rgbplot_ms),layout=c(2,2))
```
... and validate the predictions and calculate percentage within the AOA.
```{r aoa_performance2,echo=FALSE}
pred_muenster_valid <- predict(model_ffs,validationDat)
validation(pred_muenster_valid)
print(paste0("Percentage of Münster that is within the AOA: ",round(sum(values(AOA_ffs$AOA)==1)/ncell(AOA),2)*100," %"))
```
Using the spatial variable selection we could develop a more generalizable model which is reflected by:
1. more meaningful prediction patterns
2. higher spatial CV performance
3. a larger AOA.