Merge branch 'main' of https://github.com/e-sensing/sitsbook

e-sensing · Mar 14, 2024 · fdee48e · fdee48e
2 parents 3c54340 + 96daa9e
commit fdee48e
Show file tree

Hide file tree

Showing 8 changed files with 30 additions and 30 deletions.
diff --git a/03-intro.Rmd b/03-intro.Rmd
@@ -78,7 +78,7 @@ sits_api <- data.frame(
                "Accuracy assessment"))
 
 kableExtra::kbl(sits_api,
-                caption = "The sits API workflow for land classification",
+                caption = "The sits API workflow for land classification.",
                 booktabs = TRUE) |>
     kableExtra::kable_styling(position = "center",
                               font_size = 14,
@@ -159,9 +159,9 @@ samples_matogrosso_mod13q1[1:2,]
 
 The time series tibble contains data and metadata. The first six columns contain the metadata: spatial and temporal information, the label assigned to the sample, and the data cube from where the data has been extracted. The `time_series` column contains the time series data for each spatiotemporal location. This data is also organized as a tibble, with a column with the dates and the other columns with the values for each spectral band. For more details on handling time series data, please see the Chapter [Working with time series](https://e-sensing.github.io/sitsbook/working-with-time-series.html).
 
-It is helpful to plot the dispersion of the time series. In what follows, for brevity, we will filter only one label (Forest) and select one index (NDVI). Note that for filtering the label we use a function from `dplyr` package, while for selecting the index we use `sits_select()`.  The resulting plot shows all the time series associated with the label Forest and index NDVI, highlighting the median and the first and third quartiles.
+It is helpful to plot the dispersion of the time series. In what follows, for brevity, we will filter only one label (Forest) and select one index (NDVI). Note that for filtering the label we use a function from `dplyr` package, while for selecting the index we use `sits_select()`. Figure \ref{fig:timeseriesforest}, the resulting plot, shows all the time series associated with the label Forest and index NDVI, highlighting the median and the first and third quartiles.
 
-```{r, out.width = "80%", tidy="styler", fig.align = 'center', fig.cap="Joint plot of all samples in band NDVI for label Forest (Source: Authors).", strip.white = FALSE}
+```{r, timeseriesforest, out.width = "80%", tidy="styler", fig.align = 'center', fig.cap="Joint plot of all samples in band NDVI for label Forest (Source: Authors).", strip.white = FALSE}
 samples_forest <- dplyr::filter(
     samples_matogrosso_mod13q1, 
     label == "Forest"
@@ -241,7 +241,7 @@ sinop_map <- sits_label_classification(
 plot(sinop_map, title = "Sinop Classification Map")
 ```
 
-When plotting the classified map, users can control the map display by setting various options associated to `tmap_options`. These options include: (a) `scale` (default = 0.5); (b) `graticules_labels_size` (default: 0.7); (c) `legend_title_size` (default: 1.0); (d) `legend_text_size`  (default: 1.0); (e) `legend_width` (default: 0.5); (f) `legend_height` (default: 0.7);  (g) `legend_position` (default: c("left", "bottom"). The `scale` parameter affect all others. Users should first try to adjust it before fine-tuning the other options.
+When plotting the classified map, users can control the map display by setting various options associated to `tmap_options`. These options include: (a) `scale` (default: 0.5); (b) `graticules_labels_size` (default: 0.7); (c) `legend_title_size` (default: 1.0); (d) `legend_text_size`  (default: 1.0); (e) `legend_width` (default: 0.5); (f) `legend_height` (default: 0.7);  (g) `legend_position` (default: `c("left", "bottom")`). The `scale` parameter affect all others. Users should first try to adjust it before fine-tuning the other options.
 
 The resulting classification files can be read by QGIS. Links to the associated files are available in the `sinop_map` object in the nested table `file_info`.
 
@@ -266,6 +266,6 @@ In this chapter, we used `plot()` to produce a graphical display of data cubes,
 sits_view(sinop, band = "NDVI", class_cube = sinop_map)
 ```
 
-```{r, echo = FALSE, out.width="90%", fig.caption = "Leaflet visualization of classification of Sinop, MT, Brasil", fig.align="center"}
+```{r, echo = FALSE, out.width="90%", fig.caption = "Leaflet visualization of classification of Sinop, MT, Brasil (Source: Authors).", fig.align="center"}
 knitr::include_graphics("images/view_sinop.png")
 ```
diff --git a/04-datacubes.Rmd b/04-datacubes.Rmd
@@ -51,7 +51,7 @@ Machine learning and deep learning (ML/DL) classification algorithms require the
 3. The temporal dimension is a set of continuous and equally-spaced intervals. 
 4. For every combination of dimensions, a cell has a single value.
 
-All cells of a data cube have the same spatiotemporal extent. The spatial resolution of each cell is the same in X and Y dimensions. All temporal intervals are the same. Each cell contains a valid set of measures. Each pixel is associated to a unique coordinate in a zone of the coordinate system.   For each position in space, the data cube should provide a set of valid time series. For each time interval, the regular data cube should provide a valid 2D image (see Figure \@ref(fig:dc). 
+All cells of a data cube have the same spatiotemporal extent. The spatial resolution of each cell is the same in X and Y dimensions. All temporal intervals are the same. Each cell contains a valid set of measures. Each pixel is associated to a unique coordinate in a zone of the coordinate system.   For each position in space, the data cube should provide a set of valid time series. For each time interval, the regular data cube should provide a valid 2D image (see Figure \@ref(fig:dc)). 
 
 ```{r dc, echo = FALSE, out.width="100%", fig.align="center", fig.cap="Conceptual view of data cubes (Source: Authors)."}
 knitr::include_graphics("images/datacube_conception.png")
@@ -83,7 +83,7 @@ The result of `sits_cube()` is a tibble with a description of the selected image
 
 Amazon Web Services (AWS) holds two kinds of collections: *open-data* and *requester-pays*. Open data collections can be accessed without cost. Requester-pays collections require payment from an AWS account. Currently, `sits` supports collection `SENTINEL-2-L2A` which is open data.  The bands in 10 m resolution are B02, B03, B04, and B08. The  20 m bands are B05, B06, B07, B8A, B11, and B12. Bands B01 and B09 are available at 60 m resolution. A CLOUD band is also available. The example below shows how to access one tile of the open data `SENTINEL-2-L2A` collection.  The `tiles` parameter allows selecting the desired area according to the MGRS reference system. 
 
-```{r, tidy="styler", out.width="100%", fig.align="center", fig.cap= "Sentinel-2 image in an area of the Northeastern coast of Brazil."}
+```{r, tidy="styler", out.width="100%", fig.align="center", fig.cap= "Sentinel-2 image in an area of the Northeastern coast of Brazil (Source: Authors)."}
 # Create a data cube covering an area in Brazil
 s2_23MMU_cube <- sits_cube(
     source = "AWS",
@@ -147,7 +147,7 @@ plot(s2_L8_cube_MPC, red = "RED", green = "GREEN", blue = "BLUE",
 ## Assessing Digital Earth Africa{-}
 
 Digital Earth Africa (DEAFRICA) is a cloud service that provides open-access Earth observation data for the African continent. The ARD image collections in `sits` are `S2_L2A` (Sentinel-2 level 2A) and `LS8_SR` (Landsat-8). Since the STAC interface for DEAFRICA does not implement the concept of tiles, users need to specify their area of interest using the `roi` parameter. The requested `roi` produces a cube that contains one MGRS tiles ("35LPH") covering an area of Madagascar that includes the Betsiboka Estuary.  
-```{r, tidy="styler", out.width="100%", fig.align="center", fig.cap="Sentinel-2 image in an area over Madagascar."}
+```{r, tidy="styler", out.width="100%", fig.align="center", fig.cap="Sentinel-2 image in an area over Madagascar (Source: Authors)."}
 dea_s2_cube <- sits_cube(
     source = "DEAFRICA",
     collection = "S2_L2A",
@@ -178,11 +178,11 @@ The BDC uses three hierarchical grids based on the Albers Equal Area projection
 knitr::include_graphics("images/bdc_grid.png")
 ```
 
-To access the BDC, users must provide their credentials using environment variables, as shown below. Obtaining a BDC access key is free. Users must register at the [BDC site](https://brazildatacube.dpi.inpe.br/portal/explore) to obtain the key.
+To access the BDC, users must provide their credentials using environment variables, as shown below. Obtaining a BDC access key is free. Users must register at the [BDC site](https://brazildatacube.dpi.inpe.br/portal/explore) to obtain a key.
 
 In the example below, the data cube is defined as one tile ("005004") of `CBERS-WFI-16D` collection, which holds CBERS AWFI images at 16 days resolution.
 
-```{r, tidy="styler", out.width="100%", fig.align="center", fig.cap="CBERS-4 WFI image in a Cerrado area in Brazil."}
+```{r, tidy="styler", out.width="100%", fig.align="center", fig.cap="CBERS-4 WFI image in a Cerrado area in Brazil (Source: Authors)."}
 # Define a tile from the CBERS-4/4A AWFI collection
 cbers_tile <- sits_cube(
     source = "BDC",
@@ -203,7 +203,7 @@ plot(cbers_tile,
 
 ## Accessing Harmonized Landsat-Sentinel collections {-}
 
-Harmonized Landsat Sentinel (HLS) is a NASA initiative that processes and harmonizes Landsat 8 and Sentinel-2 imagery to a common standard, including atmospheric correction, alignment, resampling, and corrections for BRDF (bidirectional reflectance distribution function). The purpose of the HLS project is to create a unified and consistent dataset that integrates the advantages of both systems, making it easier for users to work with the data.
+Harmonized Landsat Sentinel (HLS) is a NASA initiative that processes and harmonizes Landsat 8 and Sentinel-2 imagery to a common standard, including atmospheric correction, alignment, resampling, and corrections for BRDF (bidirectional reflectance distribution function). The purpose of the HLS project is to create a unified and consistent dataset that integrates the advantages of both systems, making it easier to work with the data.
 
 The NASA Harmonized Landsat and Sentinel (HLS) service provides two image collections:
 
@@ -242,7 +242,7 @@ hls_cube_s2 <- sits_cube(
 plot(hls_cube_s2, red = "RED", green = "GREEN", blue = "BLUE", date = "2020-06-20")
 ```
 
-```{r, echo = FALSE, eval = TRUE, out.width="100%", fig.align="center", fig.cap="Plot of Sentinel-2 image obtained from the NASA HLS collection for date 2020-06-15 showing the island of Ilhabela in the Brazilian coast."}
+```{r, echo = FALSE, eval = TRUE, out.width="100%", fig.align="center", fig.cap="Plot of Sentinel-2 image obtained from the NASA HLS collection for date 2020-06-15 showing the island of Ilhabela in the Brazilian coast  (Source: Authors)."}
 
 knitr::include_graphics("images/hls_ilhabela_s2.png") 
 ```
@@ -299,7 +299,7 @@ plot(hls_cube_merged,
      date = "2020-07-11")
 ```
 
-```{r, echo = FALSE, eval = TRUE, out.width="100%", fig.align="center", fig.cap="Plot of Sentinel-2 image obtained from the NASA HLS collection for date 2020-06-15 showing the island of Ilhabela in the Brazilian coast."}
+```{r, echo = FALSE, eval = TRUE, out.width="100%", fig.align="center", fig.cap="Plot of Sentinel-2 image obtained from merging NASA HLS collection and Sentinel-2 collection for date 2020-06-15 showing the island of Ilhabela in the Brazilian coast (Source: Authors)."}
 knitr::include_graphics("images/hls_ilhabela_l8.png") 
 ```
 

diff --git a/06-timeseries.Rmd b/06-timeseries.Rmd
@@ -18,7 +18,7 @@ data("samples_matogrosso_mod13q1")
 samples_matogrosso_mod13q1[1:4,]
 ```
 
-The time series tibble contains data and metadata. The first six columns contain spatial and temporal information, the label assigned to the sample, and the data cube from where the data has been extracted. The first sample has been labeled Pasture at location ($-58.5631$, $-13.8844$), being valid for the period (2006-09-14, 2007-08-29). Informing the dates where the label is valid is crucial for correct classification. In this case, the researchers labeling the samples used the agricultural calendar in Brazil. The relevant dates for other applications and other countries will likely differ from those used in the example. The `time_series` column contains the time series data for each spatiotemporal location. This data is also organized as a tibble, with a column with the dates and the other columns with the values for each spectral band. 
+The time series tibble contains data and metadata. The first six columns contain spatial and temporal information, the label assigned to the sample, and the data cube from where the data has been extracted. The first sample has been labeled Pasture at location (-58.5631, -13.8844), being valid for the period (2006-09-14, 2007-08-29). Informing the dates where the label is valid is crucial for correct classification. In this case, the researchers labeling the samples used the agricultural calendar in Brazil. The relevant dates for other applications and other countries will likely differ from those used in the example. The `time_series` column contains the time series data for each spatiotemporal location. This data is also organized as a tibble, with a column with the dates and the other columns with the values for each spectral band. 
 
 ## Utilities for handling time series{-}
 
@@ -89,7 +89,7 @@ $$
 
 The function `sits_patterns()` uses a GAM to predict an idealized approximation to the time series associated with each class for all bands. The resulting patterns can be viewed using `plot()`.
 
-```{r, tidy="styler", out.width = "100%", fig.align="center", fig.cap="Patterns for the samples for Mato Grosso."}
+```{r, tidy="styler", out.width = "100%", fig.align="center", fig.cap="Patterns for the samples for Mato Grosso (Source: Authors)."}
 # Estimate the patterns for each class and plot them
 samples_matogrosso_mod13q1 |>  
     sits_patterns() |> 

diff --git a/07-clustering.Rmd b/07-clustering.Rmd
@@ -11,7 +11,7 @@ Selecting good training samples for machine learning classification of satellite
 
 It is necessary to distinguish between wrongly labeled samples and differences resulting from the natural variability of class signatures. When training data belongs to a large geographic region, the variability of vegetation phenology leads to different patterns being assigned to the same label. A related issue is the limitation of crisp boundaries to describe the natural world. Class definitions use idealized descriptions (e.g., "a savanna woodland has tree cover of 50% to 90% ranging from 8 to 15 m in height"). In practice, the boundaries between classes are fuzzy and sometimes overlap, making it hard to distinguish between them. To improve sample quality, `sits` provides methods for evaluating the training data.
 
-Given a set of training samples, experts should first perform a cross-validation of the training set, to be able to assess their inherent prediction error. The results indicate whether the data is internally consistent. Since cross-validation is not a predictor of actual model performance, this chapter provides additional tools for improving the quality of training sets. More detailed information is available on Chapter ["Validation and Accuracy Assessment"](https://e-sensing.github.io/sitsbook/validation-and-accuracy-measurements.html).
+Given a set of training samples, experts should first perform a cross-validation of the training set, to be able to assess their inherent prediction error. The results indicate whether the data is internally consistent. Since cross-validation is not a predictor of actual model performance, this chapter provides additional tools for improving the quality of training sets. More detailed information is available on Chapter [Validation and accuracy measurements](https://e-sensing.github.io/sitsbook/validation-and-accuracy-measurements.html).
 
 ## Datasets used in this chapter{-}
 
@@ -202,7 +202,7 @@ plot(new_cluster_mixture)
 
 As expected, the new confusion map shows a significant improvement over the previous one. This result should be interpreted carefully since it may be due to different effects. The most direct interpretation is that Millet_Cotton and Silviculture cannot be easily separated from the other classes, given the current attributes (a time series of NDVI and EVI indices from MODIS images). In such situations, users should consider improving the number of samples from the less represented classes, including more MODIS bands, or working with higher resolution satellites. The results of the SOM method should be interpreted based on the users' understanding of the ecosystems and agricultural practices of the study region. 
 
-A further comparison between the original and clean samples is to run a 5-fold validation on the original and the cleaned sample sets using `sits_kfold_validate()` and a random forest model. The SOM procedure improves the validation results from 95% on the original dataset to 99% in the cleaned one. This improvement should not be interpreted as providing a better fit for the final map accuracy. A 5-fold validation procedure only measures how well the machine learning model fits the samples; it is not an accuracy assessment of classification results. The result only indicates that the training set after the SOM sample removal procedure is more internally consistent than the original one. For more details on accuracy measures, please see Chapter [Validation and accuracy measures](https://e-sensing.github.io/sitsbook/validation-and-accuracy-measurements.html).
+A further comparison between the original and clean samples is to run a 5-fold validation on the original and the cleaned sample sets using `sits_kfold_validate()` and a random forest model. The SOM procedure improves the validation results from 95% on the original dataset to 99% in the cleaned one. This improvement should not be interpreted as providing a better fit for the final map accuracy. A 5-fold validation procedure only measures how well the machine learning model fits the samples; it is not an accuracy assessment of classification results. The result only indicates that the training set after the SOM sample removal procedure is more internally consistent than the original one. For more details on accuracy measures, please see Chapter [Validation and accuracy measurements](https://e-sensing.github.io/sitsbook/validation-and-accuracy-measurements.html).
 
 ```{r, tidy = "styler", message = FALSE, warning = FALSE}
 # Run a k-fold validation