# Chapter 14 Ecology

## Prerequisites

This chapter assumes you have a strong grasp of geographic data analysis and processing, covered in Chapters 2 to 5. In it you will also make use of R’s interfaces to dedicated GIS software, and spatial cross-validation, topics covered in Chapters 9 and 11, respectively.

The chapter uses the following packages:

## 14.1 Introduction

In this chapter we will model the floristic gradient of fog oases to reveal distinctive vegetation belts that are clearly controlled by water availability. To do so, we will bring together concepts presented in previous chapters and even extend them (Chapters 2 to 5 and Chapters 9 and 11).

Fog oases are one of the most fascinating vegetation formations we have ever encountered. These formations, locally termed lomas, develop on mountains along the coastal deserts of Peru and Chile.76 The deserts’ extreme conditions and remoteness provide the habitat for a unique ecosystem, including species endemic to the fog oases. Despite the arid conditions and low levels of precipitation of around 30-50 mm per year on average, fog deposition increases the amount of water available to plants during austal winter. This results in green southern-facing mountain slopes along the coastal strip of Peru (Figure 14.1). This fog, which develops below the temperature inversion caused by the cold Humboldt current in austral winter, provides the name for this habitat. Every few years, the El Niño phenomenon brings torrential rainfall to this sun-baked environment (Dillon, Nakazawa, and Leiva 2003). This causes the desert to bloom, and provides tree seedlings a chance to develop roots long enough to survive the following arid conditions.

Unfortunately, fog oases are heavily endangered. This is mostly due to human activity (agriculture and climate change). To effectively protect the last remnants of this unique vegetation ecosystem, evidence is needed on the composition and spatial distribution of the native flora (J. Muenchow, Bräuning, et al. 2013; J. Muenchow, Hauenstein, et al. 2013). Lomas mountains also have economic value as a tourist destination, and can contribute to the well-being of local people via recreation. For example, most Peruvians live in the coastal desert, and lomas mountains are frequently the closest “green” destination.

In this chapter we will demonstrate ecological applications of some of the techniques learned in the previous chapters. This case study will involve analyzing the composition and the spatial distribution of the vascular plants on the southern slope of Mt. Mongón, a lomas mountain near Casma on the central northern coast of Peru (Figure 14.1).

During a field study to Mt. Mongón, we recorded all vascular plants living in 100 randomly sampled 4x4 m2 plots in the austral winter of 2011 (J. Muenchow, Bräuning, et al. 2013). The sampling coincided with a strong La Niña event that year (see ENSO monitoring of the NOASS Climate Prediction Center). This led to even higher levels of aridity than usual in the coastal desert. On the other hand, it also increased fog activity on the southern slopes of Peruvian lomas mountains.

Ordinations are dimension-reducing techniques which allow the extraction of the main gradients from a (noisy) dataset, in our case the floristic gradient developing along the southern mountain slope (see next section). In this chapter we will model the first ordination axis, i.e., the floristic gradient, as a function of environmental predictors such as altitude, slope, catchment area and NDVI. For this, we will make use of a random forest model - a very popular machine learning algorithm (Breiman 2001). The model will allow us to make spatial predictions of the floristic composition anywhere in the study area. To guarantee an optimal prediction, it is advisable to tune beforehand the hyperparameters with the help of spatial cross-validation (see Section 11.5.2).

## 14.2 Data and data preparation

All the data needed for the subsequent analyses is available via the RQGIS package.

study_area is an sf polygon representing the outlines of the study area. random_points is an sf object, and contains the 100 randomly chosen sites. comm is a community matrix of the wide data format (Wickham 2014b) where the rows represent the visited sites in the field and the columns the observed species.77

The values represent species cover per site, and were recorded as the area covered by a species in proportion to the site area in percentage points (%; please note that one site can have >100% due to overlapping cover between individual plants). The rownames of comm correspond to the id column of random_points. dem is the digital elevation model (DEM) for the study area, and ndvi is the Normalized Difference Vegetation Index (NDVI) computed from the red and near-infrared channels of a Landsat scene (see Section 4.3.3 and ?ndvi). Visualizing the data helps to get more familiar with it, as shown in Figure 14.2 where the dem is overplotted by the random_points and the study_area.

The next step is to compute variables which we will not only need for the modeling and predictive mapping (see Section 14.4.2) but also for aligning the Non-metric multidimensional scaling (NMDS) axes with the main gradient in the study area, altitude and humidity, respectively (see Section 14.3).

Specifically, we will compute catchment slope and catchment area from a digital elevation model using R-GIS bridges (see Chapter 9). Curvatures might also represent valuable predictors, in the Exercise section you can find out how they would change the modeling result.

To compute catchment area and catchment slope, we will make use of the saga:sagawetnessindex function.78 get_usage() returns all function parameters and default values of a specific geoalgorithm. Here, we present only a selection of the complete output.

Subsequently, we can specify the needed parameters using R named arguments (see Section 9.2). Remember that we can use a RasterLayer living in R’s global environment to specify the input raster DEM (see Section 9.2). Specifying 1 as the SLOPE_TYPE makes sure that the algorithm will return the catchment slope. The resulting output rasters should be saved to temporary files with an .sdat extension which is a SAGA raster format. Setting load_output to TRUE ensures that the resulting rasters will be imported into R.

This returns a list named ep consisting of two elements: AREA and SLOPE. Let us add two more raster objects to the list, namely dem and ndvi, and convert it into a raster stack (see Section 2.3.3).

### 14.4.2 Predictive mapping

The tuned hyperparameters can now be used for the prediction. We simply have to modify our learner using the result of the hyperparameter tuning, and run the corresponding model.

The last step is to apply the model to the spatially available predictors, i.e., to the raster stack. So far, raster::predict() does not support the output of ranger models, hence, we will have to program the prediction ourselves. First, we convert ep into a prediction data frame which secondly serves as input for the predict.ranger() function. Thirdly, we put the predicted values back into a RasterLayer (see Section 3.3.1 and Figure 14.5).

The predictive mapping clearly reveals distinct vegetation belts (Figure 14.5). Please refer to J. Muenchow, Hauenstein, et al. (2013) for a detailed description of vegetation belts on lomas mountains. The blue color tones represent the so-called Tillandsia-belt. Tillandsia is a highly adapted genus especially found in high quantities at the sandy and quite desertic foot of lomas mountains. The yellow color tones refer to a herbaceous vegetation belt with a much higher plant cover compared to the Tillandsia-belt. The orange colors represent the bromeliad belt, which features the highest species richness and plant cover. It can be found directly beneath the temperature inversion (ca. 750-850 m asl) where humidity due to fog is highest. Water availability naturally decreases above the temperature inversion, and the landscape becomes desertic again with only a few succulent species (succulent belt; red colors). Interestingly, the spatial prediction clearly reveals that the bromeliad belt is interrupted which is a very interesting finding we would have not detected without the predictive mapping.

## 14.5 Conclusions

In this chapter we have ordinated the community matrix of the lomas Mt. Mongón with the help of a NMDS (Section 14.3). The first axis, representing the main floristic gradient in the study area, was modeled as a function of environmental predictors which partly were derived through R-GIS bridges (Section 14.2). The mlr package provided the building blocks to spatially tune the hyperparameters mtry, sample.fraction and min.node.size (Section 14.4.1). The tuned hyperparameters served as input for the final model which in turn was applied to the environmental predictors for a spatial representation of the floristic gradient (Section 14.4.2). The result demonstrates spatially the astounding biodiversity in the middle of the desert. Since lomas mountains are heavily endangered, the prediction map can serve as basis for informed decision-making on delineating protection zones, and making the local population aware of the uniqueness found in their immediate neighborhood.

• It would be interesting to also model the second ordination axis, and to subsequently find an innovative way of visualizing jointly the modeled scores of the two axes in one prediction map.
• If we were interested in interpreting the model in an ecologically meaningful way, we should probably use (semi-)parametric models (J. Muenchow, Bräuning, et al. 2013; Zuur et al. 2009, 2017). However, there are at least approaches that help to interpret machine learning models such as random forests (see, e.g., https://mlr-org.github.io/interpretable-machine-learning-iml-and-mlr/).
• A sequential model-based optimization (SMBO) might be preferable to the random search for hyperparameter optimization used in this chapter (Probst, Wright, and Boulesteix 2018).

Finally, please note that random forest and other machine learning models are frequently used in a setting with lots of observations and many predictors, much more than used in this chapter, and where it is unclear which variables and variable interactions contribute to explaining the response. Additionally, the relationships might be highly non-linear. In our use case, the relationship between response and predictors are pretty clear, there is only a slight amount of non-linearity and the number of observations and predictors is low. Hence, it might be worth trying a linear model. A linear model is much easier to explain and understand than a random forest model, and therefore to be preferred (law of parsimony), additionally it is computationally less demanding (see Exercises). If the linear model cannot cope with the degree of non-linearity present in the data, one could also try a generalized additive model (GAM). The point here is that the toolbox of a data scientist consists of more than one tool, and it is your responsibility to select the tool best suited for the task or purpose at hand. Here, we wanted to introduce the reader to random forest modeling and how to use the corresponding results for spatial predictions. For this purpose, a well-studied dataset with known relationships between response and predictors, is appropriate. However, this does not imply that the random forest model has returned the best result in terms of predictive performance (see Exercises).

## 14.6 Exercises

1. Run a NMDS using the percentage data of the community matrix. Report the stress value and compare it to the stress value as retrieved from the NMDS using presence-absence data. What might explain the observed difference?

2. Compute all the predictor rasters we have used in the chapter (catchment slope, catchment area), and put them into a raster stack. Add dem and ndvi to the raster stack. Next, compute profile and tangential curvature as additional predictor rasters and add them to the raster stack (hint: grass7:r.slope.aspect). Finally, construct a response-predictor matrix. The scores of the first NMDS axis (which were the result when using the presence-absence community matrix) rotated in accordance with elevation represent the response variable, and should be joined to random_points (use an inner join). To complete the response-predictor matrix, extract the values of the environmental predictor raster stack to random_points.

3. Use the response-predictor matrix of the previous exercise to fit a random forest model. Find the optimal hyperparameters and use them for making a prediction map.

4. Retrieve the bias-reduced RMSE of a random forest model using spatial cross-validation including the estimation of optimal hyperparameter combinations (random search with 50 iterations) in an inner tuning loop (see Section 11.5.2). Parallelize the tuning level (see Section 11.5.2). Report the mean RMSE and use a boxplot to visualize all retrieved RMSEs.

5. Retrieve the bias-reduced RMSE of a simple linear model using spatial cross-validation. Compare the result to the result of the random forest model by making RMSE boxplots for each modeling approach.

### References

Borcard, Daniel, François Gillet, and Pierre Legendre. 2011. Numerical Ecology with R. Use R! New York: Springer.

Breiman, Leo. 2001. “Random Forests.” Machine Learning 45 (1): 5–32. https://doi.org/10.1023/A:1010933404324.

Dillon, M. O., M. Nakazawa, and S. G. Leiva. 2003. “The Lomas Formations of Coastal Peru: Composition and Biogeographic History.” In El Niño in Peru: Biology and Culture over 10,000 Years, edited by J. Haas and M. O. Dillon, 1–9. Chicago: Field Museum of Natural History.

Hengl, Tomislav, Madlene Nussbaum, Marvin N. Wright, Gerard B.M. Heuvelink, and Benedikt Gräler. 2018. “Random Forest as a Generic Framework for Predictive Modeling of Spatial and Spatio-Temporal Variables.” PeerJ 6 (August): e5518. https://doi.org/10.7717/peerj.5518.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani, eds. 2013. An Introduction to Statistical Learning: With Applications in R. Springer Texts in Statistics 103. New York: Springer.

McCune, Bruce, James B. Grace, and Dean L. Urban. 2002. Analysis of Ecological Communities. Second. Gleneden Beach, OR: MjM Software Design.

Muenchow, Jannes, Achim Bräuning, Eric Frank Rodríguez, and Henrik von Wehrden. 2013. “Predictive Mapping of Species Richness and Plant Species’ Distributions of a Peruvian Fog Oasis Along an Altitudinal Gradient.” Biotropica 45 (5): 557–66. https://doi.org/10.1111/btp.12049.

Muenchow, Jannes, Simon Hauenstein, Achim Bräuning, Rupert Bäumler, Eric Frank Rodríguez, and Henrik von Wehrden. 2013. “Soil Texture and Altitude, Respectively, Largely Determine the Floristic Gradient of the Most Diverse Fog Oasis in the Peruvian Desert.” Journal of Tropical Ecology 29 (05): 427–38. https://doi.org/10.1017/S0266467413000436.

Muenchow, Jannes, Patrick Schratz, and Alexander Brenning. 2017. “RQGIS: Integrating R with QGIS for Statistical Geocomputing.” The R Journal 9 (2): 409–28.

Probst, Philipp, Marvin Wright, and Anne-Laure Boulesteix. 2018. “Hyperparameters and Tuning Strategies for Random Forest.” arXiv:1804.03515 [Cs, Stat], April. http://arxiv.org/abs/1804.03515.

Schratz, Patrick, J. Muenchow, Eugenia Iturritxa, Jakob Richter, and A. Brenning. 2018. “Performance Evaluation and Hyperparameter Tuning of Statistical and Machine-Learning Models Using Spatial Data.”

von Wehrden, Henrik, Jan Hanspach, Helge Bruelheide, and Karsten Wesche. 2009. “Pluralism and Diversity: Trends in the Use and Application of Ordination Methods 1990-2007.” Journal of Vegetation Science 20 (4): 695–705. https://doi.org/10.1111/j.1654-1103.2009.01063.x.

Wickham, Hadley. 2014b. “Tidy Data.” Journal of Statistical Software 59 (10). https://doi.org/10.18637/jss.v059.i10.

Zuur, Alain F., Elena N. Ieno, Anatoly A. Saveliev, and Alain F. Zuur. 2017. Beginner’s Guide to Spatial, Temporal and Spatial-Temporal Ecological Data Analysis with R-INLA. Vol. 1. Newburgh, United Kingdom: Highland Statistics Ltd.

Zuur, Alain, Elena N. Ieno, Neil Walker, Anatoly A. Saveliev, and Graham M. Smith. 2009. Mixed Effects Models and Extensions in Ecology with R. Statistics for Biology and Health. New York: Springer-Verlag.

1. Similar vegetation formations develop also in other parts of the world, e.g., in Namibia and along the coasts of Yemen and Oman (Galletti, Turner, and Myint 2016).

2. In statistics, this is also called a contingency table or cross-table.

3. Admittedly, it is a bit unsatisfying that the only way of knowing that sagawetnessindex computes the desired terrain attributes is to be familiar with SAGA.

4. One way of choosing k is to try k values between 1 and 6 and then using the result which yields the best stress value (McCune, Grace, and Urban 2002).