vignettes update

rvalavi · Apr 7, 2023 · 4cf9a89 · 4cf9a89
1 parent 9e72a45
commit 4cf9a89
Show file tree

Hide file tree

Showing 13 changed files with 242 additions and 163 deletions.
diff --git a/R/cv_spatial.R b/R/cv_spatial.R
@@ -17,8 +17,8 @@
 #' measurement. However, when the input map has a geographic coordinate system (in decimal degrees),
 #' the block size is calculated by dividing the \code{size} parameter by \code{deg_to_metre} (which
 #' defaults to 111325 meters, the standard distance of one degree of latitude on the Equator).
-#' This converts the unit to degrees, which varies along the latitude by a factor of the cosine of
-#' the latitude. So, a better value could be \code{cos(mean(sf::st_bbox(x)[c(2,4)]) * pi/180) * 111325}.
+#' In reality, this value varies by a factor of the cosine of the latitude. So, an alternative sensible
+#' value could be \code{cos(mean(sf::st_bbox(x)[c(2,4)]) * pi/180) * 111325}.
 #'
 #' The \code{offset} can be used to change the spatial position of the blocks. It can also be used to
 #' assess the sensitivity of analysis results to shifting in the blocking arrangements.

diff --git a/inst/doc/tutorial_1.R b/inst/doc/tutorial_1.R
@@ -52,7 +52,7 @@ sb1 <- cv_spatial(x = pa_data,
 ## ---- warning=FALSE, message=FALSE, fig.height=5, fig.width=7-----------------
 sb2 <- cv_spatial(x = pa_data,
                   column = "occ",
-                  r = rasters,
+                  r = rasters, # optionally add a raster layer
                   k = 5, 
                   size = 350000, 
                   hexagon = FALSE, # use square blocks
@@ -65,7 +65,7 @@ sb2 <- cv_spatial(x = pa_data,
 ## ----warning=FALSE, message=FALSE, fig.height=5, fig.width=7------------------
 # systematic fold assignment 
 # and also use row/column for creating blocks instead of size
-sb2 <- cv_spatial(x = pa_data,
+sb3 <- cv_spatial(x = pa_data,
                   column = "occ",
                   rows_cols = c(12, 10),
                   hexagon = FALSE,
@@ -128,7 +128,7 @@ cv_plot(cv = scv,
 ## ----warning=FALSE, message=FALSE, fig.height=5, fig.width=8------------------
 cv_plot(cv = bloo,
         x = pa_data,
-        num_plots = c(1, 50, 100)) 
+        num_plots = c(1, 50, 100)) # only show folds 1, 50 and 100
 
 
 ## ----warning=FALSE, message=FALSE, fig.height=5, fig.width=7------------------

diff --git a/inst/doc/tutorial_1.Rmd b/inst/doc/tutorial_1.Rmd
@@ -108,12 +108,14 @@ tm_shape(rasters[[1]]) +
 
 ## Block cross-validation strategies
 
+The `blockCV` stores training and testing folds in three different formats. The common format for all three blocking strategies is a list of the indices of observations in each fold. For `cv_spatial` and `cv_cluster` (but not `cv_buffer` and `cv_nndm`), the folds are also stored in a matrix format suitable for the `biomod2` package and a vector of fold's number for each observation. This is equal to the number of observation in spatial sample data (argument `x` in functions). These three formats are stored in the cv objects as `folds_list`, `biomod_table` and `folds_ids` respectively.
+
 ### Spatial blocks
-The function `cv_spatial` creates spatial blocks/polygons then assigns blocks to the training and testing folds with *random*, *checkerboard pattern* or a *systematic* way. Spatial blocks can be defined either by `size` or number of rows and columns.
+The function `cv_spatial` creates spatial blocks/polygons then assigns blocks to the training and testing folds with *random*, *checkerboard pattern* or a *systematic* way (with the selection argument). When `selection = "random"`, the function tries to find evenly distributed records in training and testing folds. Spatial blocks can be defined either by `size` or number of rows and columns.
 
-Consistent with other functions, the distance (`size`) should be in **metres**, regardless of the unit of the reference system of the input data. When the input map has *geographic coordinate system* (decimal degrees), the block size is calculated based on dividing `size` by 111325 (the standard distance of a degree in metres, on the Equator) to change metre to degree. This value can be changed by the user via the `deg_to_metre` argument.
+Consistent with other functions, the distance (`size`) should be in **metres**, regardless of the unit of the reference system of the input data. When the input map has *geographic coordinate system* (i.e. decimal degrees), the block size is calculated based on dividing `size` by 111325 (the standard distance of a degree in metres, on the Equator) to change metre to degree. In reality, this value varies by a factor of the cosine of the latitude. So,  an alternative sensible value could be `cos(mean(sf::st_bbox(x)[c(2,4)]) * pi/180) * 111325`.
 
-The `offset` can be used to shift the spatial position of the blocks in horizontal and vertical axes, respectively. This only works when the block have been built based on `size`. The blocks argument allows users to define an external spatial polygon as blocking layer.
+The `offset` argument can be used to shift the spatial position of the blocks in horizontal and vertical axes, respectively. This only works when the block have been built based on `size`, and the `extend` option allows user to enlarge the blocks ensuring all points fall inside the blocks (most effectve when `rows_cols` is used). The blocks argument allows users to define an external spatial polygon as blocking layer.
 
 Here are some spatial block settings:
 
@@ -128,12 +130,14 @@ sb1 <- cv_spatial(x = pa_data,
 
 ```
 
-The same setting can be used to create square blocks (by `hexagon = FALSE`). You can optionally add a raster layer (using `r` argument) for to be used for creating blocks and be used in the background of the plot (raster can also be added later only for visualising blocks using `cv_plot`).
+The output object is an R S3 object and you can get its elements by a `$`. Explore `sb1$folds_ids`, `sb1$folds_list`, and `sb1$biomod_table` for the three types of generated folds from the `cv_spatial` object `sb1`. Use the one suitable for you modelling practice to evaluate your models. See the explanation of all other outputs/elements of the function in the help file of the function.
+
+The same setting from previous code can be used to create square blocks by using `hexagon = FALSE`. You can optionally add a raster layer (using `r` argument) for to be used for creating blocks and be used in the background of the plot (raster can also be added later only for visualising blocks using `cv_plot`).
 
 ```{r, warning=FALSE, message=FALSE, fig.height=5, fig.width=7}
 sb2 <- cv_spatial(x = pa_data,
                   column = "occ",
-                  r = rasters,
+                  r = rasters, # optionally add a raster layer
                   k = 5, 
                   size = 350000, 
                   hexagon = FALSE, # use square blocks
@@ -144,20 +148,20 @@ sb2 <- cv_spatial(x = pa_data,
 
 ```
 
-The assignment of folds to each block can also be done in a systematic manner using `selection = "systematic"`, or a checkerboard pattern using `selection = "checkerboard"`. 
-
+The assignment of folds to each block can also be done in a systematic manner using `selection = "systematic"`, or a checkerboard pattern using `selection = "checkerboard"`. The blocks can also be created by number of rows and columns when no `size` is supplied by e.g. `rows_cols = c(12, 10)`.
 
 ```{r warning=FALSE, message=FALSE, fig.height=5, fig.width=7}
 # systematic fold assignment 
 # and also use row/column for creating blocks instead of size
-sb2 <- cv_spatial(x = pa_data,
+sb3 <- cv_spatial(x = pa_data,
                   column = "occ",
                   rows_cols = c(12, 10),
                   hexagon = FALSE,
                   selection = "systematic")
 
 ```
 
+The output report of the function shows that when the selection is set to 'random', the number of presence/absence instances in the train and test folds are more evenly distributed compared to when it is set to 'systematic'.
 
 ```{r warning=FALSE, message=FALSE, fig.height=5, fig.width=7}
 # checkerboard block to CV fold assignment
@@ -178,8 +182,6 @@ tm_shape(sb4$blocks) +
 
 ```
 
-The blocks can also be created by number of rows and columns when no `size` is supplied by e.g. `rows_cols = c(10, 2)`.
-
 
 ### Spatial and environemntal clustering
 
@@ -196,7 +198,7 @@ scv <- cv_cluster(x = pa_data,
 ```
 
 
-The clustering can be done in environmental space by supplying `r`. Notice, this could be an extreme case of cross-validation as the testing folds could possibly fall in novel environmental conditions than what the training points are (check `cv_similarity` for testing this). Note that the input raster layer should cover all the species points, otherwise an error will rise. The records with no raster value should be deleted prior to the analysis.
+The clustering can be done in environmental space by supplying `r`. Notice, this could be an extreme case of cross-validation as the testing folds could possibly fall in novel environmental conditions than what the training points are (check `cv_similarity` for testing this). Note that the input raster layer should cover all the species points, otherwise an error will rise. The records with no raster value should be deleted prior to the analysis or a different raster be used.
 
 ```{r warning=FALSE, message=FALSE}
 # environmental clustering
@@ -209,7 +211,7 @@ ecv <- cv_cluster(x = pa_data,
 
 ```
 
-When `r` is supplied, all the input rasters are first centred and scaled to avoid one raster variable dominate the clusters.
+When `r` is supplied, all the input rasters are first centred and scaled to avoid one raster variable dominate the clusters using `scale = TRUE` option.
 
 By default, the clustering will be done based only on the values of the predictors at the sample points. In this case, and the number of the folds will be the same as `k`. If `raster_cluster = TRUE`, the clustering is done in the raster space. In this approach, the clusters will be consistent throughout the region and across species (in the same region). However, this may result in cluster(s) that cover none of the species records especially when species data is not dispersed throughout the region (or environmental ranges) or the number of clusters (`k` or folds) is high. 
 
@@ -227,9 +229,10 @@ bloo <- cv_buffer(x = pa_data,
 
 ```
 
-For species **presence-absence** data and any other types of data (such as **continuous**, **counts**, and **multi-class** targets) keep `presence_bg = FALSE`. In this case, all sample points other than the target point within the buffer are excluded, and the training set comprises all points outside the buffer.
+When using species **presence-background** data (or presence and pseudo-absence), you need to supply the `column` and set `presence_bg = TRUE`. In this case, only presence points (1s) are considered as target points. For more information read the details section in the help of the function (i.e. `help(cv_buffer)`). 
+
+For species **presence-absence** data and any other types of data (such as **continuous**, **counts**, and **multi-class** targets) keep `presence_bg = FALSE` (default). In this case, all sample points other than the target point within the buffer are excluded, and the training set comprises all points outside the buffer.
 
-When using species **presence-background** data (or presence and pseudo-absence), you need to supply the `column` and set `presence_bg = TRUE`. In this case, only presence points are considered as target points. For more information read the details section in the help of the function (i.e. `help(cv_buffer)`). 
 
 ## Nearest Neighbour Distance Matching (NNDM) LOO
 
@@ -250,7 +253,7 @@ nncv <- cv_nndm(x = pa_data,
 
 ## Visualising the folds
 
-You can visualise the generate folds for all block cross-validation strategies. You can optionally add a raster layer for background maps using `r` option. When `r` is supplied the plots might be slightly slower.
+You can visualise the generate folds for all block cross-validation strategies. You can optionally add a raster layer as background map using `r` option. When `r` is supplied the plots might be slightly slower.
 
 Let's plot spatial clustering folds created in previous section (using `cv_cluster`):
 
@@ -260,12 +263,12 @@ cv_plot(cv = scv,
 
 ```
 
-When `cv_buffer` is used for plotting, only first 10 folds are shown. You can choose any set of CV folds for plotting. If `remove_na = FALSE` (default is `TRUE`), the `NA` in legend shows the excluded points.
+When `cv_buffer` is used for plotting, only first 10 folds are shown. You can choose any set of CV folds for plotting. If `remove_na = FALSE` (default is `TRUE`), the `NA` in the legend shows the excluded points.
 
 ```{r warning=FALSE, message=FALSE, fig.height=5, fig.width=8}
 cv_plot(cv = bloo,
         x = pa_data,
-        num_plots = c(1, 50, 100)) 
+        num_plots = c(1, 50, 100)) # only show folds 1, 50 and 100
 
 ```
 
@@ -282,8 +285,7 @@ cv_plot(cv = sb1,
 
 ## Check similarity
 
-The `cv_similarity` function can check for environmental similarity between the training and testing folds and thus possible extrapolation. It computes multivariate environmental similarity surface (MESS) as described in Elith et al. (2010). MESS represents how similar a point in a testing fold is to a training fold (as a reference set of points), with respect to a set of predictor variables in `r`. The negative values are the sites where at least one variable has a value that is outside the range of environments over the reference set, so these are novel environments.
-
+The `cv_similarity` function can check for environmental similarity between the training and testing folds and thus possible extrapolation in the testing folds. It computes multivariate environmental similarity surface (MESS) as described in Elith et al. (2010). MESS represents how similar a point in a testing fold is to a training fold (as a reference set of points), with respect to a set of predictor variables in `r`. The negative values are the sites where at least one variable has a value that is outside the range of environments over the reference set, so these are novel environments.
 
 ```{r fig.height=4, fig.width=6}
 cv_similarity(cv = ecv, # the environmental clustering

diff --git a/inst/doc/tutorial_1.html b/inst/doc/tutorial_1.html
diff --git a/inst/doc/tutorial_2.R b/inst/doc/tutorial_2.R
@@ -42,7 +42,7 @@ scv1 <- cv_spatial(
   size = 360000, # size of the blocks in metres
   selection = "random", # random blocks-to-fold
   iteration = 50, # find evenly dispersed folds
-  progress = FALSE, # trun off progress bar
+  progress = FALSE, # turn off progress bar
   biomod2 = TRUE, # also create folds for biomod2
   raster_colors = terrain.colors(10, rev = TRUE) # options from cv_plot for a better colour contrast
 ) 
@@ -64,7 +64,7 @@ scv2 <- cv_nndm(
   r = rasters,
   size = 360000, # range of spatial autocorrelation
   num_sample = 10000, # number of samples of prediction points
-  sampling = "regular", # sampling methods
+  sampling = "regular", # sampling methods; it can be random as well
   min_train = 0.1, # minimum portion to keep in each train fold
   plot = TRUE
 )
@@ -116,7 +116,8 @@ cv_plot(
 #  
 
 ## ----echo=FALSE---------------------------------------------------------------
-read.csv("../man/figures/roc_rf.csv")
+# to not run the model and reduce run time; result are calculated and loaded
+read.csv("../man/figures/roc_rf.csv") 
 
 
 ## ---- eval=FALSE, fig.height=3.7, fig.width=7---------------------------------
@@ -144,7 +145,7 @@ read.csv("../man/figures/roc_rf.csv")
 #  # use generated folds from cv_spatial in previous section
 #  spatial_cv_folds <- scv1$biomod_table
 #  
-#  # 3. Defining Models Options using default options.
+#  # 3. Defining Models Options; using default options here.
 #  biomod_options <- BIOMOD_ModelingOptions()
 #  
 #  # 4. Model fitting
@@ -163,3 +164,8 @@ read.csv("../man/figures/roc_rf.csv")
 #  biomod_model_eval[c("run", "algo", "metric.eval", "calibration", "validation")]
 #  
 
+## ----echo=FALSE---------------------------------------------------------------
+# to not run the model and reduce run time; result are calculated and loaded
+read.csv("../man/figures/evl_biomod.csv") 
+
+
diff --git a/inst/doc/tutorial_2.Rmd b/inst/doc/tutorial_2.Rmd
@@ -76,7 +76,7 @@ tm_shape(rasters[[1]]) +
 
 # Generating block CV folds
 
-Here, we generate two CV strategy, one k-fold CV using `cv_spatial` and one LOO CV using `cv_nndm`. See more options and configurations in the *Tutorial 1 - introduction to `blockCV`*.
+Here, we generate two CV strategies, one k-fold CV using `cv_spatial` and one LOO CV using `cv_nndm`. See more options and configurations in the *Tutorial 1 - introduction to `blockCV`*.
 
 ```{r message=TRUE, warning=TRUE}
 library(blockCV)
@@ -95,7 +95,7 @@ scv1 <- cv_spatial(
   size = 360000, # size of the blocks in metres
   selection = "random", # random blocks-to-fold
   iteration = 50, # find evenly dispersed folds
-  progress = FALSE, # trun off progress bar
+  progress = FALSE, # turn off progress bar
   biomod2 = TRUE, # also create folds for biomod2
   raster_colors = terrain.colors(10, rev = TRUE) # options from cv_plot for a better colour contrast
 ) 
@@ -124,13 +124,13 @@ scv2 <- cv_nndm(
   r = rasters,
   size = 360000, # range of spatial autocorrelation
   num_sample = 10000, # number of samples of prediction points
-  sampling = "regular", # sampling methods
+  sampling = "regular", # sampling methods; it can be random as well
   min_train = 0.1, # minimum portion to keep in each train fold
   plot = TRUE
 )
 ```
 
-You can visualise the generated folds of both methods using `cv_plot` function. Here is threee of the folds for `cv_nndm`:
+You can visualise the generated folds of both methods using `cv_plot` function. Here is three folds from the `cv_nndm` object:
 
 ```{r}
 # see the number of folds in scv2 object
@@ -156,9 +156,9 @@ In this section, we show how to use the folds generated by `blockCV` in the prev
 
 ### Using `blockCV` with Random Forest model
 
-Folds generated by `cv_nndm` function are used here (a training and testing fold for each record) to show how to use folds from this function (the `cv_buffer` is similar to this appraoch) for evaluation species distribution models.   
+Folds generated by `cv_nndm` function are used here (a training and testing fold for each record) to show how to use folds from this function (the `cv_buffer` is also similar to this approach) for evaluation species distribution models.   
 
-Note that with `cv_nndm` using presence-absence data (and any other type of data except for presence-background data when `presence_bg = TRUE` is used), there is only one point in each testing fold, and therefore AUC cannot be calculated for each fold separately. Instead, the value of each point is first predicted, and then a unique AUC is calculated for the full set of predictions.
+Note that with `cv_nndm` using presence-absence data (and any other type of data except for presence-background data when `presence_bg = TRUE` is used), there is only one point in each testing fold, and therefore AUC cannot be calculated for each fold separately. Instead, the value of each point is first predicted to the testing point (of each fold), and then a unique AUC is calculated for the full set of predictions.
 
 ```{r, eval=FALSE}
 # loading the libraries
@@ -197,7 +197,8 @@ auc(precrec_obj)
 
 
 ```{r echo=FALSE}
-read.csv("../man/figures/roc_rf.csv")
+# to not run the model and reduce run time; result are calculated and loaded
+read.csv("../man/figures/roc_rf.csv") 
 
 ```
 
@@ -236,7 +237,7 @@ biomod_data <- BIOMOD_FormatingData(resp.var = pa_data$occ,
 # use generated folds from cv_spatial in previous section
 spatial_cv_folds <- scv1$biomod_table
 
-# 3. Defining Models Options using default options.
+# 3. Defining Models Options; using default options here.
 biomod_options <- BIOMOD_ModelingOptions()
 
 # 4. Model fitting
@@ -256,10 +257,14 @@ biomod_model_eval <- get_evaluations(biomod_model_out)
 biomod_model_eval[c("run", "algo", "metric.eval", "calibration", "validation")]
 
 ```
-The `validation` column shows the result of spatial cross-validation.
 
-Note that the result of this section (biomod model evaluation) is not shown.
+```{r echo=FALSE}
+# to not run the model and reduce run time; result are calculated and loaded
+read.csv("../man/figures/evl_biomod.csv") 
+
+```
 
+The `validation` column shows the result of spatial cross-validation and each RUN is a CV fold.
 
 
 ## References: