10 Bridges to GIS software

Prerequisites

  • This chapter requires QGIS, SAGA and GRASS to be installed and the following packages to be attached:
# remotes::install_github("paleolimbot/qgisprocess")
library(qgisprocess)
library(Rsagacmd)
library(rgrass)
library(rstac)
library(gdalcubes)

10.1 Introduction

A defining feature of R is the way you interact with it: you type commands and hit Enter (or Ctrl+Enter if writing code in the source editor in RStudio) to execute them interactively. This way of interacting with the computer is called a command-line interface (CLI) (see definition in the note below). CLIs are not unique to R.46 In dedicated GIS packages, by contrast, the emphasis tends to be on the graphical user interface (GUI). You can interact with QGIS, SAGA, GRASS and gvSIG from system terminals and embedded CLIs, but ‘pointing and clicking’ is the norm. This means many GIS users miss out on the advantages of the command-line according to Gary Sherman, creator of QGIS (Sherman 2008):

With the advent of ‘modern’ GIS software, most people want to point and click their way through life. That’s good, but there is a tremendous amount of flexibility and power waiting for you with the command line. Many times you can do something on the command line in a fraction of the time you can do it with a GUI.

The ‘CLI vs GUI’ debate can be adversarial but it does not have to be; both options can be used interchangeably, depending on the task at hand and the user’s skillset.47 The advantages of a good CLI such as that provided by R (and enhanced by IDEs such as RStudio) are numerous. A good CLI:

  • Facilitates the automation of repetitive tasks
  • Enables transparency and reproducibility, the backbone of good scientific practice and data science
  • Encourages software development by providing tools to modify existing functions and implement new ones
  • Helps develop future-proof programming skills which are in high demand in many disciplines and industries
  • Is user-friendly and fast, allowing an efficient workflow

On the other hand, GUI-based GIS systems (particularly QGIS) are also advantageous. A good GIS GUI:

  • Has a ‘shallow’ learning curve meaning geographic data can be explored and visualized without hours of learning a new language
  • Provides excellent support for ‘digitizing’ (creating new vector datasets), including trace, snap and topological tools48
  • Enables georeferencing (matching raster images to existing maps) with ground control points and orthorectification
  • Supports stereoscopic mapping (e.g., LiDAR and structure from motion)

Another advantage of dedicated GISs is that they provide access to hundreds of ‘geoalgorithms’ (computational recipes to solve geographic problems — see Chapter 11). Many of these are unavailable from the R command line, except via ‘GIS bridges’, the topic of (and motivation for) this chapter.49

A command-line interface is a means of interacting with computer programs in which the user issues commands via successive lines of text (command lines). bash in Linux and PowerShell in Windows are its common examples. CLIs can be augmented with IDEs such as RStudio for R, which provides code auto-completion and other features to improve the user experience.

R originated as an interface language. Its predecessor S provided access to statistical algorithms in other languages (particularly FORTRAN), but from an intuitive read-evaluate-print loop (REPL) (Chambers 2016). R continues this tradition with interfaces to numerous languages, notably C++, as described in Chapter 1. R was not designed as a GIS. However, its ability to interface with dedicated GISs gives it astonishing geospatial capabilities. R is well known as a statistical programming language, but many people are unaware of its ability to replicate GIS workflows, with the additional benefits of a (relatively) consistent CLI. Furthermore, R outperforms GISs in some areas of geocomputation, including interactive/animated map making (see Chapter 9) and spatial statistical modeling (see Chapter 12). This chapter focuses on ‘bridges’ to three mature open source GIS products (see Table 10.1): QGIS (via the package qgisprocess; Section 10.2), SAGA (via Rsagacmd; Section 10.3) and GRASS (via rgrass; Section 10.4).50

TABLE 10.1: Comparison between three open-source GIS. Hybrid refers to the support of vector and raster operations.
GIS First release No. functions Support
QGIS 2002 >1000 hybrid
SAGA 2004 >600 hybrid
GRASS 1982 >500 hybrid

To complement the R-GIS bridges, the second part of the chapter gives a brief introduction to interfaces to spatial libraries (Section 10.6.1), spatial databases (Section 10.6.2), and cloud-based processing of Earth observation data (Section 10.7).

10.2 QGIS through qgisprocess

QGIS is one of the most popular open-source GIS [Table 10.1; Graser and Olaya (2015)]. Its main advantage lies in the fact that it provides a unified interface to several other open-source GIS. This means that you have access to GDAL, GRASS, and SAGA through QGIS (Graser and Olaya 2015). Since version 3.14, QGIS provides a command line API, qgis_process, that allows to run all these geoalgorithms (frequently more than 1000, depending on your set-up) outside of the QGIS GUI.

The qgisprocess package wraps this QGIS command-line utility, and thus makes it possible to call QGIS, GDAL, GRASS, and SAGA algorithms from the R session. Before running qgisprocess, make sure you have installed QGIS and all its (third-party) dependencies such as SAGA and GRASS.

library(qgisprocess)
#> Using 'qgis_process' in the system PATH.
#> QGIS version: 3.26.1-Buenos Aires
#> ...

This package automatically tries to detect a QGIS installation and complains if it cannot find it.51 There are a few possible solutions when the configuration fails: you can set options(qgisprocess.path = "path/to/your_qgis_process"), or set up the R_QGISPROCESS_PATH environment variable. The above approaches can also be used when you have more than one QGIS installation and want to decide which one to use.

Next, we can find which providers (meaning different software) are available on our computer.

qgis_providers()
#> # A tibble: 6 × 2
#>   provider provider_title   
#>   <chr>    <chr>            
#> 1 3d       QGIS (3D)        
#> 2 gdal     GDAL             
#> 3 grass7   GRASS            
#> 4 native   QGIS (native c++)
#> 5 qgis     QGIS             
#> 6 saga     SAGA

The output table affirms that we can use QGIS geoalgorithms (native, qgis, 3d) and external ones from the third-party providers GDAL, SAGA and GRASS through the QGIS interface.

We are now ready for some QGIS geocomputation from within R! Let’s try two example case studies. The first one shows how to unite two polygonal datasets with different borders (Section 10.2.1). The second one focuses on deriving new information from a digital elevation model represented as a raster (Section 10.2.2).

10.2.1 Vector data

Consider a situation when you have two polygon objects with different spatial units (e.g., regions, administrative units). Our goal is to merge these two objects into one, containing all of the boundary lines and related attributes. We use again the incongruent polygons we have already encountered in Section 4.2.7 (Figure 10.1). Both polygon datasets are available in the spData package, and for both we would like to use a geographic CRS (see also Chapter 7).

data("incongruent", "aggregating_zones", package = "spData")
incongr_wgs = st_transform(incongruent, "EPSG:4326")
aggzone_wgs = st_transform(aggregating_zones, "EPSG:4326")
Illustration of two areal units: incongruent (black lines) and aggregating zones (red borders).

FIGURE 10.1: Illustration of two areal units: incongruent (black lines) and aggregating zones (red borders).

To find an algorithm to do this work, we can search the output of the qgis_algorithms() function. This function returns a data frame containing all of the available providers and the algorithms they contain.52

qgis_algo = qgis_algorithms()

The qgis_algo object has a lot of columns, but usually, we are only interested in the algorithm column that combines information about the provider and the algorithm name. Assuming that the short description of the function contains the word “union”, we can run the following code to find the algorithm of interest:

grep("union", qgis_algo$algorithm, value = TRUE)
#> [1] "native:multiunion" "native:union"      "saga:fuzzyunionor" "saga:polygonunion"

One of the algorithms on the above list, "native:union", sounds promising. The next step is to find out what this algorithm does and how we can use it. This is the role of the qgis_show_help(), which returns a short summary of what the algorithm does, its arguments, and outputs.53 This makes it output rather long. Here, we only present the arguments"native:union" expects.

alg = "native:union"
qgis_arguments(alg) |> 
  dplyr::mutate(acceptable_values = unlist(acceptable_values)) |>
  dplyr::select(name, description, acceptable_values)

These arguments are INPUT, OVERLAY, OVERLAY_FIELDS_PREFIX, and OUTPUT. It seems that some of the above arguments expect a “path to a vector layer” (column: acceptaple_values). However, the qgisprocess package also allows to provide sf objects as well in these cases.54 Though this is really convenient, if your spatial data is already available in your R session, we recommend to provide the path to your spatial data on disk when you only read it in to submit it to a qgisprocess algorithm because the first thing qgisprocess does when executing a geoalgorithm is to export the spatial data living in your R session back to disk in a format known to QGIS such as .gpkg or .tif files.

Finally, we can let QGIS do the work. The main function of qgisprocess is qgis_run_algorithm(). It accepts the used algorithm name and a set of named arguments shown in the help list, and performs expected calculations. In our case, three arguments seem important - INPUT, OVERLAY, and OUTPUT. The first one, INPUT, is our main vector object incongr_wgs, while the second one, OVERLAY, is aggzone_wgs. The last argument, OUTPUT, expects a path to a new vector file; however, if we do not provide the path, qgisprocess will automatically create a temporary file. The .quiet argument only tells qgisprocess to be less verbose.

union = qgis_run_algorithm(alg, INPUT = incongr_wgs, OVERLAY = aggzone_wgs, 
                           .quiet = TRUE)
union

Running the above line of code will save our two input objects into temporary .gpkg files, run the selected algorithm on them, and return a temporary .gpkg file as the output. The qgisprocess package stores the qgis_run_algorithm() result as a list containing, in this case, a path to the output file. We can either read this file back into R with read_sf() (e.g., union_sf = read_sf(union[[1]])) or directly with st_as_sf():

union_sf = st_as_sf(union)

Note that the QGIS union operation merges the two input layers into one layer by using the intersection and the symmetrical difference of the two input layers (which, by the way, is also the default when doing a union operation in GRASS and SAGA). This is not the same as st_union(incongr_wgs, aggzone_wgs) (see Exercises)!

Our result, union_sf, is a multipolygon with a larger number of features than two input objects . Notice, however, that many of these polygons are small and do not represent real areas but are rather a result of our two datasets having a different level of detail. These artifacts of error are called sliver polygons (see red-colored polygons in the left panel of Figure 10.2) One way to identify slivers is to find polygons with comparatively very small areas, here, e.g., 25000 m2, and next remove them. Let’s search for an appropriate algorithm.

grep("clean", qgis_algo$algorithm, value = TRUE)

This time the found algorithm, v.clean, is not included in QGIS, but GRASS GIS. GRASS GIS’s v.clean is a powerful tool for cleaning topology of spatial vector data. Importantly, we can use it through qgisprocess.

Similarly to the previous step, we should start by looking at this algorithm’s help.

qgis_show_help("grass7:v.clean")

We have omitted the output here, because the help text is quite long and contains a lot of arguments.55 This is because v.clean is a multi tool – it can clean different types of geometries and solve different types of topological problems. For this example, let’s focus on just a few arguments, however, we encourage you to visit this algorithm’s documentation to learn more about v.clean capabilities.

qgis_arguments("grass7:v.clean") |>
  dplyr::select(name, description) |>
  dplyr::slice_head(n = 4)
#> # A tibble: 4 × 2
#>   name      description                              
#>   <chr>     <chr>                                    
#> 1 input     Layer to clean                           
#> 2 type      Input feature type                       
#> 3 tool      Cleaning tool                            
#> 4 threshold Threshold (comma separated for each tool)

The main argument for this algorithm is input – our vector object. Next, we need to select a tool – a cleaning method. 56 About a dozen tools exist in v.clean allowing to remove duplicate geometries, remove small angles between lines, or remove small areas, among others. In this case, we are interested in the latter tool, rmarea. Several of the tools, rmarea included, expect an additional argument threshold, whose behavior depends on the selected tool. In our case, the rmarea tool removes all areas smaller or equal to a provided threshold.

Let’s run this algorithm and convert its output into a new sf object clean_sf.

clean = qgis_run_algorithm("grass7:v.clean", input = union_sf,
                           tool = "rmarea", threshold = 25000, .quiet = TRUE)
clean_sf = st_as_sf(clean)

The result, the right panel of 10.2, looks as expected – sliver polygons are now removed.

Sliver polygons colored in red (left panel). Cleaned polygons (right panel).

FIGURE 10.2: Sliver polygons colored in red (left panel). Cleaned polygons (right panel).

10.2.2 Raster data

Digital elevation models (DEMs) contain elevation information for each raster cell. They are used for many purposes, including satellite navigation, water flow models, surface analysis, or visualization. Here, we are interested in deriving new information from a DEM raster that could be used as predictors for statistical learning. Various terrain parameters, for example, can be helpful for the prediction of landslides (see Chapter 12)

For this section, we will use dem.tif – a digital elevation model of the Mongón study area (downloaded from the Land Process Distributed Active Archive Center, see also ?dem.tif) It has a resolution of about 30 by 30 meters and uses a projected CRS.

library(qgisprocess)
library(terra)
dem = rast(system.file("raster/dem.tif", package = "spDataLarge"))

The terra package’s terrain() command already allows the calculation of several fundamental topographic characteristics such as slope, aspect, TPI (Topographic Position Index), TRI (Topographic Ruggedness Index), roughness, and flow directions. However, GIS programs offer many more terrain characteristics, some of which can be more suitable in certain contexts. For example, the topographic wetness index (TWI) was found useful in studying hydrological and biological processes (Sørensen, Zinko, and Seibert 2006). Let’s search the algorithm list for this index using "wetness" as keyword.

qgis_algo = qgis_algorithms()
grep("wetness", qgis_algo$algorithm, value = TRUE)
#> [1] "saga:sagawetnessindex"           "saga:topographicwetnessindextwi"

An output of the above code suggests that the desired algorithm exists in the SAGA GIS software.57 Though SAGA is a hybrid GIS, its main focus has been on raster processing, and here, particularly on digital elevation models (soil properties, terrain attributes, climate parameters). Hence, SAGA is especially good at the fast processing of large (high-resolution) raster datasets (Conrad et al. 2015).

The "saga:sagawetnessindex" algorithm is actually a modified TWI, that results in a more realistic soil moisture potential for the cells located in valley floors (Böhner and Selige 2006).

qgis_show_help("saga:sagawetnessindex")
# output not shown here

Here, we stick with the default values for all arguments. Therefore, we only have to specify one argument – the input DEM. Of course, when applying this algorithm you should make sure that the default values are in correspondence with your study aim.58 "saga:sagawetnessindex" returns not one but four rasters – catchment area, catchment slope, modified catchment area, and topographic wetness index.

dem_wetness = qgis_run_algorithm("saga:sagawetnessindex", DEM = dem, 
                                 .quiet = TRUE)

The result, dem_wetness, is a list with file paths to the four outputs. We can read a selected output by providing an output name in the qgis_as_terra() function.

dem_wetness_twi = qgis_as_terra(dem_wetness$TWI)

You can see the output TWI map on the left panel of Figure 10.3. The topographic wetness index is unitless. Its low values represent areas that will not accumulate water, while higher values show areas that will accumulate water at increasing levels.

Information from digital elevation models can also be categorized, for example, to geomorphons – the geomorphological phonotypes consisting of 10 classes that represent terrain forms, such as slopes, ridges, or valleys (Jasiewicz and Stepinski 2013). These phonotypes are used in many studies, including landslide susceptibility, ecosystem services, human mobility, and digital soil mapping.

The original implementation of the geomorphons’ algorithm was created in GRASS GIS, and we can find it in the qgisprocess list as "grass7:r.geomorphon":

grep("geomorphon", qgis_algo$algorithm, value = TRUE)
#> [1] "grass7:r.geomorphon"
qgis_show_help("grass7:r.geomorphon")
# output not shown

Calculation of geomorphons requires an input DEM (elevation), and can be customized with a set of optional arguments. It includes, search – a length for which the line-of-sight is calculated, and -m – a flag specifying that the search value will be provided in meters (and not the number of cells). More information about additional arguments can be found in the original paper and the GRASS GIS documentation.

dem_geomorph = qgis_run_algorithm("grass7:r.geomorphon", elevation = dem, 
                                    `-m` = TRUE, search = 120, .quiet = TRUE)

Our output, dem_geomorph$forms, contains a raster file with 10 categories – each one representing a terrain form. We can read it into R with qgis_as_terra(), and then visualize it (the right panel of Figure 10.3) or use it in our subsequent calculations.

dem_geomorph_terra = qgis_as_terra(dem_geomorph$forms)

Interestingly, there are connections between some geomorphons and the TWI values, as shown in Figure 10.3. The largest TWI values mostly occur in valleys and hollows, while the lowest values are seen, as expected, on ridges.

Topographic wetness index (TWI, left panel) and geomorphons (right panel) derived for the Mongón study area.

FIGURE 10.3: Topographic wetness index (TWI, left panel) and geomorphons (right panel) derived for the Mongón study area.

10.3 SAGA GIS

The System for Automated Geoscientific Analyses (SAGA; Table 10.1) provides the possibility to execute SAGA modules via the command line interface (saga_cmd.exe under Windows and just saga_cmd under Linux) (see the SAGA wiki on modules). In addition, there is a Python interface (SAGA Python API). Rsagacmd uses the former to run SAGA from within R.

We will use Rsagacmd in this section to delineate areas with similar values of the normalized difference vegetation index (NDVI) of the Mongón study area in Peru from the 22nd of September 2000 (the left panel of Figure 10.4) by using a seeded region growing algorithm from SAGA GIS.59

ndvi = rast(system.file("raster/ndvi.tif", package = "spDataLarge"))

To start using Rsagacmd, we need to run the saga_gis() function. It serves two main purposes:

  • It dynamically60 creates a new object that contains links to all valid SAGA-GIS libraries and tools
  • It sets up general package options, such as raster_backend (R package to use for handling raster data), vector_backend (R package to use for handling vector data), and cores (a maximum number of CPU cores used for processing, default: all)
library(Rsagacmd)
saga = saga_gis(raster_backend = "terra", vector_backend = "sf")

Our saga object contains connections to all of the available SAGA tools. It is organized as a list of libraries (groups of tools), and inside of a library it has a list of tools. We can access any tool with the $ sign (remember to use TAB for autocompletion).

The seeded region growing algorithm works in two main steps (Adams and Bischof 1994; Böhner, Selige, and Ringeler 2006). First, initial cells (“seeds”) are generated by finding the cells with the smallest variance in local windows of a specified size. Then, the region growing algorithm is used to merge neighboring pixels of the seeds to create homogeneous areas.

sg = saga$imagery_segmentation$seed_generation

In the above example, we first opened the imagery_segmentation library and then its seed_generation tool. We also assigned it to the sg object, not to retype the whole tool code in our next steps.61 If we just type sg, we will get a quick summary of the tool and a data frame with its parameters, descriptions, and defaults. You may also use tidy(sg) to extract just the parameters’ table. The seed_generation tool takes a a raster dataset as its first argument (features); optional arguments include band_width that specifies the size of initial polygons.

ndvi_seeds = sg(ndvi, band_width = 2)
# plot(ndvi_seeds$seed_grid)

Our output is a list of three objects: variance – a raster map of local variance, seed_grid – a raster map with the generated seeds, and seed_points – a spatial vector object with the generated seeds.

The second SAGA GIS tool we use is seeded_region_growing.62 The seeded_region_growing tool requires two inputs: our seed_grid calculated in the previous step and the ndvi raster object. Additionally, we can specify several parameters, such as normalize to standardize the input features, neighbour (4 or 8-neighborhood), and method. The last parameter can be set to either 0 or 1 (region growing is based on raster cells’ values and their positions or just the values). For a more detailed description of the method, see Böhner, Selige, and Ringeler (2006).

Here, we will only change method to 1, meaning that our output regions will be created only based on the similarity of their NDVI values.

srg = saga$imagery_segmentation$seeded_region_growing
ndvi_srg = srg(ndvi_seeds$seed_grid, ndvi, method = 1)
plot(ndvi_srg$segments)

The tool returns a list of three objects: segments, similarity, table. The similarity object is a raster showing similarity between the seeds and the other cells, and table is a data frame storing information about the input seeds. Finally, ndvi_srg$segments is a raster with our resulting areas (the right panel of Figure 10.4). We can convert it into polygons with as.polygons() and st_as_sf() (Section 6.5).

ndvi_segments = as.polygons(ndvi_srg$segments) |> 
  st_as_sf()
Normalized difference vegetation index (NDVI, left panel) and NDVi-based segments derived using the seeded region growing algorithm for the Mongón study area.

FIGURE 10.4: Normalized difference vegetation index (NDVI, left panel) and NDVi-based segments derived using the seeded region growing algorithm for the Mongón study area.

The resulting polygons (segments) represent areas with similar values. They can also be further aggregated into larger polygons using various techniques, such as clustering (e.g., k-means), regionalization (e.g., SKATER) or supervised classification methods. You can try to do it in Exercises.

R also has other tools to achieve the goal of creating polygons with similar values (so-called segments). It includes the SegOptim package (Gonçalves et al. 2019) that allows running several image segmentation algorithms and supercells (Nowosad and Stepinski 2022) that implements superpixels algorithm SLIC to work with geospatial data.

10.4 GRASS GIS

The U.S. Army - Construction Engineering Research Laboratory (USA-CERL) created the core of the Geographical Resources Analysis Support System (GRASS) (Table 10.1; Neteler and Mitasova (2008)) from 1982 to 1995. Academia continued this work since 1997. Similar to SAGA, GRASS focused on raster processing in the beginning while only later, since GRASS 6.0, adding advanced vector functionality (Bivand, Pebesma, and Gómez-Rubio 2013).

GRASS stores the input data in a GRASS GIS database. With regard to vector data, GRASS is by default a topological GIS, i.e., it only stores the geometry of adjacent features once. SQLite is the default database driver for vector attribute management, and attributes are linked to the geometry, i.e., to the GRASS GIS database, via keys (GRASS GIS vector management).

Before one can use GRASS, one has to set up the GRASS GIS database (also from within R), and users might find this process a bit intimidating in the beginning. First of all, the GRASS database requires its own directory, which, in turn, contains a location (see the GRASS GIS Database help pages at grass.osgeo.org for further information). The location stores the geodata for one project or one area. Within one location, several mapsets can exist that typically refer to different users or different tasks. Each location also has a PERMANENT mapset – a mandatory mapset that is created automatically. In order to share geographic data with all users of a project, the database owner can add spatial data to the PERMANENT mapset. In addition, the PERMANENT mapset stores the projection, the spatial extent and the default resolution for raster data. So, to sum it all up – the GRASS GIS database may contain many locations (all data in one location have the same CRS), and each location can store many mapsets (groups of datasets). Please refer to Neteler and Mitasova (2008) and the GRASS GIS quick start for more information on the GRASS spatial database system. To quickly use GRASS from within R, we will use the link2GI package, however, one can also set up the GRASS GIS database step-by-step. See GRASS within R for how to do so.

Here, we introduce rgrass with one of the most interesting problems in GIScience - the traveling salesman problem. Suppose a traveling salesman would like to visit 24 customers. Additionally, he would like to start and finish his journey at home which makes a total of 25 locations while covering the shortest distance possible. There is a single best solution to this problem; however, to check all of the possible solutions it is (mostly) impossible for modern computers (Longley 2015). In our case, the number of possible solutions correspond to (25 - 1)! / 2, i.e., the factorial of 24 divided by 2 (since we do not differentiate between forward or backward direction). Even if one iteration can be done in a nanosecond, this still corresponds to 9837145 years. Luckily, there are clever, almost optimal solutions which run in a tiny fraction of this inconceivable amount of time. GRASS GIS provides one of these solutions (for more details, see v.net.salesman). In our use case, we would like to find the shortest path between the first 25 bicycle stations (instead of customers) on London’s streets (and we simply assume that the first bike station corresponds to the home of our traveling salesman).

data("cycle_hire", package = "spData")
points = cycle_hire[1:25, ]

Aside from the cycle hire points data, we need a street network for this area. We can download it with from OpenStreetMap with the help of the osmdata package (see also Section 8.2). To do this, we constrain the query of the street network (in OSM language called “highway”) to the bounding box of points, and attach the corresponding data as an sf-object. osmdata_sf() returns a list with several spatial objects (points, lines, polygons, etc.), but here, we only keep the line objects with their related ids.63

library(osmdata)
b_box = st_bbox(points)
london_streets = opq(b_box) |>
  add_osm_feature(key = "highway") |>
  osmdata_sf() 
london_streets = london_streets[["osm_lines"]]
london_streets = dplyr::select(london_streets, osm_id)

Now that we have the data, we can go on and initiate a GRASS session. Luckily, linkGRASS() of the link2GI packages lets one set up the GRASS environment with just one line of code. The only thing you need to provide is a spatial object which determines the projection and the extent of the spatial database. First, linkGRASS() finds all GRASS installations on your computer. Since we have set ver_select to TRUE, we can interactively choose one of the found GRASS-installations. If there is just one installation, the linkGRASS() automatically chooses it. Second, linkGRASS() establishes a connection to GRASS GIS.

library(rgrass)
link2GI::linkGRASS(london_streets, ver_select = TRUE)

Before we can use GRASS geoalgorithms, we need to add data to GRASS’s spatial database. Luckily, the convenience function write_VECT() does this for us. (Use write_RAST() for raster data.) In our case, we add the street and cycle hire point data while using only the first attribute column, and name them as london_streets and points in GRASS.

write_VECT(terra::vect(london_streets), vname = "london_streets")
write_VECT(terra::vect(points[, 1]), vname = "points")

The rgrass package expects its inputs and gives its outputs as terra objects. Therefore, we need to convert our sf spatial vectors to terra’s SpatVectors using the vect() function to be able to use write_VECT().64

Now, both datasets exist in the GRASS GIS database. To perform our network analysis, we need a topological clean street network. GRASS’s "v.clean" takes care of the removal of duplicates, small angles and dangles, among others. Here, we break lines at each intersection to ensure that the subsequent routing algorithm can actually turn right or left at an intersection, and save the output in a GRASS object named streets_clean.

execGRASS(cmd = "v.clean", input = "london_streets", output = "streets_clean",
          tool = "break", flags = "overwrite")
To learn about the possible arguments and flags of the GRASS GIS modules you can you the help flag. For example, try execGRASS("g.region", flags = "help").

It is likely that a few of our cycling station points will not lie exactly on a street segment. However, to find the shortest route between them, we need to connect them to the nearest streets segment. "v.net"’s connect-operator does exactly this. We save its output in streets_points_con.

execGRASS(cmd = "v.net", input = "streets_clean", output = "streets_points_con",
          points = "points", operation = "connect", threshold = 0.001,
          flags = c("overwrite", "c"))

The resulting clean dataset serves as input for the "v.net.salesman" algorithm, which finally finds the shortest route between all cycle hire stations. One of its arguments is center_cats, which requires a numeric range as input. This range represents the points for which a shortest route should be calculated. Since we would like to calculate the route for all cycle stations, we set it to 1-25. To access the GRASS help page of the traveling salesman algorithm, run execGRASS("g.manual", entry = "v.net.salesman").

execGRASS(cmd = "v.net.salesman", input = "streets_points_con",
          output = "shortest_route", center_cats = paste0("1-", nrow(points)),
          flags = "overwrite")

To see our result, we read the result into R, convert it into an sf-object keeping only the geometry, and visualize it with the help of the mapview package (Figure 10.5 and Section 9.4).

route = read_VECT("shortest_route") |>
  st_as_sf() |>
  st_geometry()
mapview::mapview(route) + points
Shortest route (blue line) between 24 cycle hire stations (blue dots) on the OSM street network of London.

FIGURE 10.5: Shortest route (blue line) between 24 cycle hire stations (blue dots) on the OSM street network of London.

There are a few important considerations to note in the process:

  • We could have used GRASS’s spatial database which allows faster processing. That means we have only exported geographic data at the beginning. Then we created new objects but only imported the final result back into R. To find out which datasets are currently available, run execGRASS("g.list", type = "vector,raster", flags = "p").
  • We could have also accessed an already existing GRASS spatial database from within R. Prior to importing data into R, you might want to perform some (spatial) subsetting. Use "v.select" and "v.extract" for vector data. "db.select" lets you select subsets of the attribute table of a vector layer without returning the corresponding geometry.
  • You can also start R from within a running GRASS session (for more information please refer to Bivand, Pebesma, and Gómez-Rubio 2013).
  • Refer to the excellent GRASS online help or execGRASS("g.manual", flags = "i") for more information on each available GRASS geoalgorithm.

10.5 When to use what?

To recommend a single R-GIS interface is hard since the usage depends on personal preferences, the tasks at hand and your familiarity with different GIS software packages which in turn probably depends on your field of study. As mentioned previously, SAGA is especially good at the fast processing of large (high-resolution) raster datasets, and frequently used by hydrologists, climatologists and soil scientists (Conrad et al. 2015). GRASS GIS, on the other hand, is the only GIS presented here supporting a topologically based spatial database which is especially useful for network analyses but also simulation studies. QGIS is much more user-friendly compared to GRASS- and SAGA-GIS, especially for first-time GIS users, and probably the most popular open-source GIS. Therefore, qgisprocess is an appropriate choice for most use cases. Its main advantages are:

  • A unified access to several GIS, and therefore the provision of >1000 geoalgorithms (Table 10.1) including duplicated functionality, e.g., you can perform overlay-operations using QGIS-, SAGA- or GRASS-geoalgorithms
  • Automatic data format conversions (SAGA uses .sdat grid files and GRASS uses its own database format but QGIS will handle the corresponding conversions)
  • Its automatic passing of geographic R objects to QGIS geoalgorithms and back into R
  • Convenience functions to support named arguments and automatic default value retrieval (as inspired by rgrass)

By all means, there are use cases when you certainly should use one of the other R-GIS bridges. Though QGIS is the only GIS providing a unified interface to several GIS software packages, it only provides access to a subset of the corresponding third-party geoalgorithms (for more information please refer to Muenchow, Schratz, and Brenning (2017)). Therefore, to use the complete set of SAGA and GRASS functions, stick with Rsagacmd and rgrass. Finally, if you need topological correct data and/or spatial database management functionality such as multi-user access, we recommend the usage of GRASS. In addition, if you would like to run simulations with the help of a geodatabase (Krug, Roura-Pascual, and Richardson 2010), use rgrass directly since qgisprocess always starts a new GRASS session for each call.

Please note that there are a number of further GIS software packages that have a scripting interface but for which there is no dedicated R package that accesses these: gvSig, OpenJump, and the Orfeo Toolbox.65

10.6 Other bridges

The focus of this chapter, so far, was on R interfaces to Desktop GIS software. We emphasize these bridges because dedicated GIS software is well-known and a common ‘way in’ to understanding geographic data. They also provide access to many geoalgorithms.

Other ‘bridges’ include interfaces to spatial libraries (Section 10.6.1 shows how to access the GDAL CLI from R), spatial databases (see Section 10.6.2) and web mapping services (see Chapter 9). This section provides only a snippet of what is possible. Thanks to R’s flexibility, with its ability to call other programs from the system and integration with other languages (notably via Rcpp and reticulate), many other bridges are possible. The aim is not to be comprehensive, but to demonstrate other ways of accessing the ‘flexibility and power’ in the quote by Sherman (2008) at the beginning of the chapter.

10.6.1 Bridges to GDAL

As discussed in Chapter 8, GDAL is a low-level library that supports many geographic data formats. GDAL is so effective that most GIS programs use GDAL in the background for importing and exporting geographic data, rather than re-inventing the wheel and using bespoke read-write code. But GDAL offers more than data I/O. It has geoprocessing tools for vector and raster data, functionality to create tiles for serving raster data online, and rapid rasterization of vector data. Since GDAL is a command line tool, all its commands can be accessed from within R via the system() command.

The code chunk below demonstrates this functionality: linkGDAL() searches the computer for a working GDAL installation and adds the location of the executable files to the PATH variable, allowing GDAL to be called (usually only needed under Windows).

link2GI::linkGDAL()

Now we can use the system() function to call any of the GDAL tools. For example, ogrinfo provides metadata of a vector dataset. Here we will call this tool with two additional flags: -al to list all features of all layers and -so to get a summary only (and not a complete geometry list):

our_filepath = system.file("shapes/world.gpkg", package = "spData")
cmd = paste("ogrinfo -al -so", our_filepath)
system(cmd)
#> INFO: Open of `.../spData/shapes/world.gpkg'
#>       using driver `GPKG' successful.
#> 
#> Layer name: world
#> Geometry: Multi Polygon
#> Feature Count: 177
#> Extent: (-180.000000, -89.900000) - (179.999990, 83.645130)
#> Layer SRS WKT:
#> ...

Other commonly used GDAL tools include:

  • gdalinfo: provides metadata of a raster dataset
  • gdal_translate: converts between different raster file formats
  • ogr2ogr: converts between different vector file formats
  • gdalwarp: reprojects, transform, and clip raster datasets
  • gdaltransform: transforms coordinates

Visit https://gdal.org/programs/ to see the complete list of GDAL tools and to read their help files.

The ‘link’ to GDAL provided by link2GI could be used as a foundation for doing more advanced GDAL work from the R or system CLI. TauDEM (http://hydrology.usu.edu/taudem) and the Orfeo Toolbox (https://www.orfeo-toolbox.org/) are other spatial data processing libraries/programs offering a command line interface – the above example shows how to access these libraries from the system command line via R. This in turn could be the starting point for creating a proper interface to these libraries in the form of new R packages.

Before diving into a project to create a new bridge, however, it is important to be aware of the power of existing R packages and that system() calls may not be platform-independent (they may fail on some computers). Furthermore, sf brings most of the power provided by GDAL, GEOS and PROJ to R via the R/C++ interface provided by Rcpp, which avoids system() calls.

10.6.2 Bridges to spatial databases

Spatial database management systems (spatial DBMS) store spatial and non-spatial data in a structured way. They can organize large collections of data into related tables (entities) via unique identifiers (primary and foreign keys) and implicitly via space (think for instance of a spatial join). This is useful because geographic datasets tend to become big and messy quite quickly. Databases enable storing and querying large datasets efficiently based on spatial and non-spatial fields, and provide multi-user access and topology support.

The most important open source spatial database is PostGIS (Obe and Hsu 2015).66 R bridges to spatial DBMSs such as PostGIS are important, allowing access to huge data stores without loading several gigabytes of geographic data into RAM, and likely crashing the R session. The remainder of this section shows how PostGIS can be called from R, based on “Hello real world” from PostGIS in Action, Second Edition (Obe and Hsu 2015).67

The subsequent code requires a working internet connection, since we are accessing a PostgreSQL/PostGIS database which is living in the QGIS Cloud (https://qgiscloud.com/).68 Our first step here is to create a connection to a database by providing its name, host name, and user information.

library(RPostgreSQL)
conn = dbConnect(drv = PostgreSQL(), 
                 dbname = "rtafdf_zljbqm", host = "db.qgiscloud.com",
                 port = "5432", user = "rtafdf_zljbqm", password = "d3290ead")

Our new object, conn, is just an established link between our R session and the database. It does not store any data.

Often the first question is, ‘which tables can be found in the database?’. This can be answered with dbListTables() as follows:

dbListTables(conn)
#> [1] "spatial_ref_sys" "topology"        "layer"           "restaurants"    
#> [5] "highways" 

The answer is five tables. Here, we are only interested in the restaurants and the highways tables. The former represents the locations of fast-food restaurants in the US, and the latter are principal US highways. To find out about attributes available in a table, we can run dbListFields:

dbListFields(conn, "highways")
#> [1] "qc_id"        "wkb_geometry" "gid"          "feature"     
#> [5] "name"         "state"   

Now, as we know the available datasets, we can perform some queries – ask the database some questions. The query needs to be provided in a language understandable by the database – usually, it is SQL. The first query will select US Route 1 in the state of Maryland (MD) from the highways table. Note that read_sf() allows us to read geographic data from a database if it is provided with an open connection to a database and a query. Additionally, read_sf() needs to know which column represents the geometry (here: wkb_geometry).

query = paste(
  "SELECT *",
  "FROM highways",
  "WHERE name = 'US Route 1' AND state = 'MD';")
us_route = read_sf(conn, query = query, geom = "wkb_geometry")

This results in an sf-object named us_route of type MULTILINESTRING.

As we mentioned before, it is also possible to not only ask non-spatial questions, but also query datasets based on their spatial properties. To show this, the next example adds a 35-kilometer (35,000 m) buffer around the selected highway (Figure 10.6).

query = paste(
  "SELECT ST_Union(ST_Buffer(wkb_geometry, 35000))::geometry",
  "FROM highways",
  "WHERE name = 'US Route 1' AND state = 'MD';")
buf = read_sf(conn, query = query)

Note that this was a spatial query using functions (ST_Union(), ST_Buffer()) you should be already familiar with. You find them also in the sf-package, though here they are written in lowercase characters (st_union(), st_buffer()). In fact, function names of the sf package largely follow the PostGIS naming conventions.69

The last query will find all Hardee’s restaurants (HDE) within the 35 km buffer zone (Figure 10.6).

query = paste(
  "SELECT *",
  "FROM restaurants r",
  "WHERE EXISTS (",
  "SELECT gid",
  "FROM highways",
  "WHERE",
  "ST_DWithin(r.wkb_geometry, wkb_geometry, 35000) AND",
  "name = 'US Route 1' AND",
  "state = 'MD' AND",
  "r.franchise = 'HDE');"
)
hardees = read_sf(conn, query = query)

Please refer to Obe and Hsu (2015) for a detailed explanation of the spatial SQL query. Finally, it is good practice to close the database connection as follows:70

RPostgreSQL::postgresqlCloseConnection(conn)
Visualization of the output of previous PostGIS commands showing the highway (black line), a buffer (light yellow) and four restaurants (red points) within the buffer.

FIGURE 10.6: Visualization of the output of previous PostGIS commands showing the highway (black line), a buffer (light yellow) and four restaurants (red points) within the buffer.

Unlike PostGIS, sf only supports spatial vector data. To query and manipulate raster data stored in a PostGIS database, use the rpostgis package (Bucklin and Basille 2018) and/or use command-line tools such as rastertopgsql which comes as part of the PostGIS installation.

This subsection is only a brief introduction to PostgreSQL/PostGIS. Nevertheless, we would like to encourage the practice of storing geographic and non-geographic data in a spatial DBMS while only attaching those subsets to R’s global environment which are needed for further (geo-)statistical analysis. Please refer to Obe and Hsu (2015) for a more detailed description of the SQL queries presented and a more comprehensive introduction to PostgreSQL/PostGIS in general. PostgreSQL/PostGIS is a formidable choice as an open-source spatial database. But the same is true for the lightweight SQLite/SpatiaLite database engine and GRASS which uses SQLite in the background (see Section 10.4).

If your datasets are too big for PostgreSQL/PostGIS and you require massive spatial data management and query performance, it may be worth exploring large-scale geographic querying on distributed computing systems. Such systems are outside the scope of this book but it worth mentioning that open source software providing this functionality exists. Prominent projects in this space include GeoMesa and Apache Sedona. The apache.sedona package provides an interace to the latter.

10.7 Bridges to cloud technologies and services

In recent years, cloud technologies have become more and more prominent on the internet. This also includes their use to store and process spatial data. Major cloud computing providers (Amazon Web Services, Microsoft Azure / Planetary Computer, Google Cloud Platform, and others) offer vast catalogs of open Earth observation data, such as the complete Sentinel-2 archive, on their platforms. We can use R and directly connect to and process data from these archives, ideally from a machine in the same cloud and region.

Three promising developments that make working with such image archives on cloud platforms easier and more efficient are the SpatioTemporal Asset Catalog (STAC), the cloud-optimized GeoTIFF (COG) image file format, and the concept of data cubes. Section 10.7.1 introduces these individual developments and briefly describes how they can be used from R.

Besides hosting large data archives, numerous cloud-based services to process Earth observation data have been launched during the last few years. It includes the OpenEO initiative – a unified interface between programming languages (including R) and various cloud-based services. You can find more information about OpenEO in Section 10.7.2.

10.7.1 STAC, COGs, and data cubes in the cloud

The SpatioTemporal Asset Catalog (STAC) is a general description format for spatiotemporal data that is used to describe a variety of datasets on cloud platforms including imagery, synthetic aperture radar (SAR) data, and point clouds. Besides simple static catalog descriptions, STAC-API presents a web service to query items (e.g. images) of catalogs by space, time, and other properties. In R, the rstac package (Simoes, Souza, et al. 2021) allows to connect to STAC-API endpoints and search for items. In the example below, we request all images from the Sentinel-2 Cloud-Optimized GeoTIFF (COG) dataset on Amazon Web Services that intersect with a predefined area and time of interest. The result contains all found images and their metadata (e.g. cloud cover) and URLs pointing to actual files on AWS.

library(rstac)
# Connect to the STAC-API endpoint for Sentinel-2 data
# and search for images intersecting our AOI
s = stac("https://earth-search.aws.element84.com/v0")
items = s |>
  stac_search(collections = "sentinel-s2-l2a-cogs",
              bbox = c(7.1, 51.8, 7.2, 52.8), 
              datetime = "2020-01-01/2020-12-31") |>
  post_request() |> items_fetch()

Cloud storage differs from local hard disks and traditional image file formats do not perform well in cloud-based geoprocessing. Broadly speaking, the cloud-optimized GeoTIFF format is a specific type of GeoTIFF that makes it possible to efficiently read only parts of an image from cloud storage. As a result, reading rectangular subsets of an image or reading images at lower resolution becomes much more efficient. As an R user, you don’t have to install anything to work with COGs because GDAL (and any package using it) can already work with COGs. However, keep in mind that the availability of COGs is a big plus while browsing through catalogs of data providers.

For larger areas of interest, requested images are still relatively difficult to work with: they may use different map projections, may spatially overlap, and the spatial resolution often depends on the spectral band. The gdalcubes package (Appel and Pebesma 2019) can be used to abstract from individual images and to create and process image collections as four-dimensional data cubes.

The code below shows a minimal example to create a lower resolution (250m) maximum NDVI composite from the Sentinel-2 images returned by the previous STAC-API search.

library(gdalcubes)
# Filter images from STAC response by cloud cover 
# and create an image collection object
collection = stac_image_collection(items$features, 
                  property_filter = function(x) {x[["eo:cloud_cover"]] < 10})
# Define extent, resolution (250m, daily) and CRS of the target data cube
v = cube_view(srs = "EPSG:3857", extent = collection, dx = 250, dy = 250,
              dt = "P1D") # "P1D" is an ISO 8601 duration string
# Create and process the data cube
cube = raster_cube(collection, v) |>
  select_bands(c("B04", "B08")) |>
  apply_pixel("(B08-B04)/(B08+B04)", "NDVI") |>
  reduce_time("max(NDVI)")
# gdalcubes_options(parallel = 8)
# plot(cube, zlim = c(0, 1))

To filter images by cloud cover, we provide a property filter function that is applied on each STAC result item while creating the image collection. The function receives available metadata of an image as input list and returns a single logical value such that only images for which the function yields TRUE will be considered. In this case, we ignore images with 10% or more cloud cover. For more details, please refer to this tutorial presented at OpenGeoHub summer school 2021.

The combination of STAC, COGs, and data cubes forms a cloud-native workflow to analyze (large) collections of satellite imagery in the cloud. These tools already form a backbone, for example, of the sits R package, which allows land use and land cover classification of big Earth observation data. The package builds EO data cubes from image collections available in cloud services and performs land classification of data cubes using various machine learning algorithms. For more information about sits visit https://e-sensing.github.io/sitsbook/ or read the related article (Simoes, Camara, et al. 2021).

10.7.2 openEO

OpenEO (Schramm et al. 2021) is an initiative to support interoperability among cloud services by defining a common language for processing the data. The initial idea has been described in an r-spatial.org blog post and aims at making it possible for users to change between cloud services easily with as little code changes as possible. The standardized processes use a multidimensional data cube model as an interface to the data. Implementations are available for eight different backends (see https://hub.openeo.org) to which users can connect with R, Python, JavaScript, QGIS, or a web editor and define (and chain) processes on collections. Since the functionality and data availability differs among the backends, the openeo R package (Lahn 2021) dynamically loads available processes and collections from the connected backend. Afterwards, users can load image collections, apply and chain processes, submit jobs, and explore and plot results.

The following code will connect to the openEO platform backend, request available datasets, processes, and output formats, define a process graph to compute a maximum NDVI image from Sentinel-2 data, and finally executes the graph after logging in to the backend. The openEO platform backend includes a free tier and registration is possible from existing institutional or social platform accounts.

library(openeo)
con = connect(host = "https://openeo.cloud")
p = processes() # load available processes
collections = list_collections() # load available collections
formats = list_file_formats() # load available output formats
# Load Sentinel-2 collection
s2 = p$load_collection(id = "SENTINEL2_L2A",
                       spatial_extent = list(west = 7.5, east = 8.5,
                                             north = 51.1, south = 50.1),
                       temporal_extent = list("2021-01-01", "2021-01-31"),
                       bands = list("B04","B08")) 
# Compute NDVI vegetation index
compute_ndvi = p$reduce_dimension(data = s2, dimension = "bands",
                                  reducer = function(data, context) {
                                    (data[2] - data[1]) / (data[2] + data[1])
                                  })
# Compute maximum over time
reduce_max = p$reduce_dimension(data = compute_ndvi, dimension = "t",
                                reducer = function(x, y) {max(x)})
# Export as GeoTIFF
result = p$save_result(reduce_max, formats$output$GTiff)
# Login, see https://docs.openeo.cloud/getting-started/r/#authentication
login(login_type = "oidc", provider = "egi", 
      config = list(client_id = "...", secret = "..."))
# Execute processes
compute_result(graph = result, output_file = tempfile(fileext = ".tif"))

10.8 Exercises

E1. Compute global solar irradiation for an area of system.file("raster/dem.tif", package = "spDataLarge") for March 21 at 11:00 AM using the r.sun GRASS GIS through qgisprocess.

E2. Compute catchment area and catchment slope of system.file("raster/dem.tif", package = "spDataLarge") using Rsagacmd.

E3. Continue working on the ndvi_segments object created in the SAGA GIS section. Extract average NDVI values from the ndvi raster and group them into six clusters using kmeans(). Visualize the results.

E4. Attach data(random_points, package = "spDataLarge") and read system.file("raster/dem.tif", package = "spDataLarge") into R. Select a point randomly from random_points and find all dem pixels that can be seen from this point (hint: viewshed can be calculated using GRASS GIS). Visualize your result. For example, plot a hillshade, the digital elevation model, your viewshed output, and the point. Additionally, give mapview a try.

E5. Use gdalinfo via a system call for a raster file stored on disk of your choice. What kind of information you can find there?

E6. Use gdalwarp to decrease the resolution of your raster file (for example, if the resolution is 0.5, change it into 1). Note: -tr and -r flags will be used in this exercise.

E7. Query all Californian highways from the PostgreSQL/PostGIS database living in the QGIS Cloud introduced in this chapter.

E8. The ndvi.tif raster (system.file("raster/ndvi.tif", package = "spDataLarge")) contains NDVI calculated for the Mongón study area based on Landsat data from September 22nd, 2000. Use rstac, gdalcubes, and terra to download Sentinel-2 images for the same area from 2020-08-01 to 2020-10-31, calculate its NDVI, and then compare it with the results from ndvi.tif.