--- title: "SDMtune - variable selection" author: "Sergio Vignali, Arnaud Barras & Veronika Braunisch" bibliography: SDMtune.bib output: html_vignette: toc: yes toc_depth: 2 vignette: > %\VignetteIndexEntry{SDMtune - variable selection} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown} --- ```{r knitr-options, include=FALSE} knitr::opts_chunk$set(comment = "#>", collapse = TRUE, eval = FALSE, fig.align = "center") ``` # Load data and train model The following steps are described in the **basic-use** vignette, refer to it if the following code is not clear: ```{r load-data} library(SDMtune) library(zeallot) # Prepare data files <- list.files(path = file.path(system.file(package = "dismo"), "ex"), pattern = "grd", full.names = TRUE) predictors <- terra::rast(files) data <- prepareSWD(species = "Virtual species", p = virtualSp$presence, a = virtualSp$background, env = predictors, categorical = "biome") c(train, test) %<-% trainValTest(data, test = 0.2, only_presence = TRUE, seed = 25) # Train model model <- train("Maxent", data = train) # Train cross validation model folds <- randomFolds(data, k = 4, only_presence = TRUE, seed = 25) cv_model <- train("Maxent", data = data, folds = folds) ``` # Variable importance For a **Maxent** model we can get the variable importance values from the output of the MaxEnt Java software. These values are stored in the model object and can be displayed using the following command: ```{r maxent-results} model@model@results ``` The function `maxentVarImp` extracts the variable importance values from the previous output and formats them in a more human readable way: ```{r maxent-var-importance} vi <- maxentVarImp(model) vi ``` As you can see the function returns a `data.frame` with the variable name, the percent contribution and the permutation importance. You can plot the variable importance as a bar chart using the function `plotVarImp`. For example you can plot the percent contribution using: ```{r maxent-percent-contribution-plot} plotVarImp(vi[, 1:2]) ``` or the permutation importance with: ```{r maxent-permutation-importance-plot} plotVarImp(vi[, c(1,3)]) ``` **SDMtune** has its own function to compute the permutation importance that iterates through several permutations and return an averaged value together with the standard deviation. We will use this function to compute the permutation importance of a **Maxnet** model. ### Permutation importance For this example we train a **Maxnet** model: ```{r maxnet-model} maxnet_model <- train("Maxnet", data = train) ``` Now we can calculate the variable importance with the function `varImp()` using 5 permutations: ```{r variable-importance} vi_maxnet <- varImp(maxnet_model, permut = 5) vi_maxnet ``` And plot it with: ```{r plot-var-imp} plotVarImp(vi_maxnet) ``` Next we compute the permutation importance for the **Maxent** model using 10 permutations and compare the results with the Maxent output: ```{r permut-exercise} # Compute the permutation importance vi_maxent <- varImp(model, permut = 10) # Print it vi_maxent # Compare with Maxent output maxentVarImp(model) ``` The difference is probably due to a different shuffling of the presence and background locations during the permutation process and because in this example we performed 10 permutations and averaged the values. ## Jackknife test for variable importance Another method to estimate the variable importance is the leave one out Jackknife test. The test removes one variable at time and records the change in the chosen metric. We use the function `doJk`, the AUC as evaluation metric and the `maxnet_model`: ```{r jk} jk <- doJk(maxnet_model, metric = "auc", test = test) jk ``` We can also plot the output using the function `plotJk`. In the following example we plot the previous result and we add a line representing the AUC of the full model trained using all the variables. First we plot the Jackknife test for the training AUC: ```{r plot-jk-train} plotJk(jk, type = "train", ref = auc(maxnet_model)) ``` and the Jackknife test for the testing AUC: ```{r plot-jk-test} plotJk(jk, type = "test", ref = auc(maxnet_model, test = test)) ``` ## Response curves With the function `plotResponse` is possible to plot the marginal and the univariate response curve. Let's plot the **cloglog** univariate response curve of **bio1**: ```{r plot-bio1} plotResponse(maxnet_model, var = "bio1", type = "cloglog", only_presence = TRUE, marginal = FALSE, rug = TRUE) ``` On top is displayed the rug of the presence locations and on bottom the rug of the background locations. As another example we can plot the **logistic** marginal response curve of **biome** that is a categorical variable, keeping the other variables at the mean value: ```{r plot-biome} plotResponse(maxnet_model, var = "biome", type = "logistic", only_presence = TRUE, marginal = TRUE, fun = mean, color = "blue") ``` In the case of an `SDMmodelCV` the response curve shows the averaged value of the prediction together with one Standard Deviation error interval: ```{r plot-cv-response} plotResponse(cv_model, var = "bio1", type = "cloglog", only_presence = TRUE, marginal = TRUE, fun = mean, rug = TRUE) ``` ## Model report All what you have learned till now con be saved and summarized calling the function `modelReport`. The function will: * save the training, background and testing locations in separated csv files; * train and evaluate the model; * create a report in a html format with the ROC curve, threshold values, response curves, predicted map and Jackknife test; * save the predicted distribution map; * save all the curves in the plot folder; * save the model with `.Rds` extension that can be loaded in R using the `readRDS` function. The function is totally inspired by the default output of the MaxEnt Java software [@Phillips2006] and extends it to other methods. You can decide what to include in the report using dedicated function arguments, like `response_curves`, `jk` and `env` but the function cannot be used with `SDMmodelCV` objects. Run the following code to create a report of the **Maxnet** model we trained before: ```{r report} modelReport(maxnet_model, type = "cloglog", folder = "virtual-sp", test = test, response_curves = TRUE, only_presence = TRUE, jk = TRUE, env = predictors) ``` The output is displayed in the browser and all the files are saved in the **virtual-sp** folder. # Data-driven variable selection To explore correlation among the variables we extract 10000 background locations using the function`randomPoints` included in the `dismo` package (we set the seed to have reproducible results). After we create an `SWD` object using the `prepareSWD` function: ```{r backgrounds} set.seed(25) bg <- terra::spatSample(predictors, size = 10000, method = "random", na.rm = TRUE, xy = TRUE, values = FALSE) bg <- prepareSWD(species = "Bgs", a = bg, env = predictors, categorical = "biome") ``` The environmental variables we downloaded have a coarse resolution and the function can extract a bit less than 10000 random locations (see the warning message). With the function `plotCor` you can create an heat map showing the degree of autocorrelation: ```{r heat-map} plotCor(bg, method = "spearman", cor_th = 0.7) ``` You can select a different correlation method or set a different correlation threshold. Another useful function is `corVar` that instead of creating a heat map prints the pairs of correlated variables according to the given method and correlation threshold: ```{r cor-var} corVar(bg, method = "spearman", cor_th = 0.7) ``` As you can see there are few variables that have a correlation coefficient greater than 0.7 in absolute value. ## Remove highly correlated variables **SDMtune** implements an algorithm that removes highly correlated variables repeating the following steps: 1. ranks the variables according to the permutation importance or the percent contribution (the second method is available only for Maxent models). 2. checks if the variable ranked as most important is highly correlated with other variables, according to the given method and correlation threshold. If the algorithm finds correlated variables it moves to the next step, otherwise checks the other variables in the rank; 3. performs a leave one out Jackknife test among the correlated variables; 4. remove the variable that decreases the less the model performance when removed, according to the given metric on the training dataset. The process is repeated until the remaining variables have a correlation coefficient lower than the provided correlation threshold. In the next example we remove the variables that have a **Spearman** correlation coefficient $|r_s|\leq0.7$ and checking the AUC on the training dataset (we use only one permutation to save time but you are free to increase this value). If you use RStudio, the function creates an interactive real-time chart in the viewer pane. Run the following code and hover over the chart during the execution of the function to get extra information: ```{r varSel} selected_variables_model <- varSel(maxnet_model, metric = "auc", test = test, bg4cor = bg, method = "spearman", cor_th = 0.7, permut = 1) ``` As you can see some variables have been removed. The output of the function is the model trained with the selected variables: ```{r output-varSel} selected_variables_model ``` In the next example we remove the highly correlated variables using the **Maxent** model, ranking the variables with the percent contribution and using the AICc as evaluation metric: ```{r exercise-1} selected_variables_model <- varSel(model, metric = "aicc", bg4cor = bg, method = "spearman", cor_th = 0.7, env = predictors, use_pc = TRUE) ``` ## Remove variables with low importance There are cases in which a model has some environmental variables ranked with very low contribution and you may want to remove some of them to reduce the model complexity. **SDMtune** offers two different strategies implemented in the function `reduceVar`. We will use the **Maxent** model trained with all the variables. Let's first check the permutation importance (we use only one permutation to save time): ```{r permutation} varImp(model, permut = 1) ``` In the first example we want to remove all the environmental variables that have a permutation importance lower than 6\%, no matter if the model performance decreases. The function removes the last ranked environmental variable, trains a new model and computes a new rank. The process is repeated until all the remaining environmental variables have an importance greater than 6\%: ```{r reduce-var-1} cat("Testing AUC before: ", auc(maxnet_model, test = test)) reduced_variables_model <- reduceVar(maxnet_model, th = 6, metric = "auc", test = test, permut = 1) cat("Testing AUC after: ", auc(reduced_variables_model, test = test)) ``` In the second example we want to remove the environmental variables that have a permutation importance lower than 15\% only if removing the variables the model performance does not decrease, according to the given metric. In this case the function performs a leave one out Jackknife test and remove the environmental variables in a step-wise fashion as described in the previous example, but only if the model performance doesn't drop: ```{r reduce-var-2} cat("Testing AUC before: ", auc(maxnet_model, test = test)) reduced_variables_model <- reduceVar(maxnet_model, th = 15, metric = "auc", test = test, permut = 1, use_jk = TRUE) cat("Testing AUC after: ", auc(reduced_variables_model, test = test)) ``` As you can see in this case several variables have been removed and the AUC in the testing dataset didn't decrease. ## References