12 Algorithms and functions for geocomputation

Prerequisites

In the introduction we promised to teach not only how to use existing tools for Geocomputation in R, but also develop new ones: “in the form of shareable R scripts and functions”. This chapter aims to deliver on that promise.

We will consider example R scripts for geographic data and how to make them more reproducible in section 12.1. Algorithms (or ‘geoalgorithms’ for geographic processes) are recipes for modifying inputs using a series of steps, resulting in an output, as described in section 12.2. To ease sharing and reproducibility algorithms can be placed into R function, which can then be distributed either in script files or as R packages, the building blocks of reproducible code. That is the topic of section 12.3.

It should be noted at the outset that none of these topics are specific to geographic data. Although geoalgorithms do have a specific meaning originating in GIS software, most of the concepts apply in other domains. For that reason instead of going into detail, our approach in this chapter is to provide illustrative examples and direct the reader to established resources, to avoid reinventing the wheel.

The approach taken in this chapter was partly inspired by Xiao (2016), who advocates explanations that are neither highly theoretical (as many academic papers are) nor entirely focussed on implementations via a GUI or CLI in a particular sofware package (as the first part of this book is, with its focus on implementations in various R packages). Instead the focus is on understanding, using reproducible code and clear explanation. The methods demonstrated in GIS Algorithms (including finding the centroid of a polygon, an example used in this chapter) are readily available in GIS software. However, the aim is “not to duplicate what is available out there, but to show how things out there work” (Xiao 2016). This chapter takes a similar approach and is therefore the most low-level and potentially advanced (in terms of the code, not application) so far.

12.1 Scripts

If packages are the building blocks of reproducible code, scripts are the glue that holds them together. There are no strict rules on what can and cannot go into script files. and nothing to prevent you from saving broken, non-reproducible code There are, however, some rules of thumb and conventions worth following when writing R scipts, outlined below:

  • Write the script in order. Just like the script of a play, scripts should have a clear order such as ‘setup’, ‘data processing’ and ‘save results’ (roughly equivalent to ‘beginning’, ‘middle’ and ‘end’ in a film).
  • Make the script reproducible. Scripts will be of more use to you and others if they are self-contained and can be run by other people. This involves stating dependencies (loading required packages at the outset, like the ‘Prerequisites’ section), reading-in data from persistent sources (e.g. from a reliable website or API) and mentioning any code that must be run before running the script (e.g. with a comment # run script0.R before this).
  • Comment the script sufficiently for others (and your future self) to understand it but not so much that the comments themselves become hard to maintain. At a minimum a good script file should contain information on the purpose of the script (see Figure 12.1) and division into chunks, perhaps by appending ---- to section headings, which allows ‘folding’ of R scripts in RStudio.

Although there is no way to enforce reproducibility in R scripts, there are tools that can help. By default RStudio ‘code-checks’ R scripts and underlines faulty code with a red way line, as illustrated below:

Illustration of 'code checking' in RStudio, which identifies the incorrect dublicate pipe operator at the outset of a script.

Figure 12.1: Illustration of ‘code checking’ in RStudio, which identifies the incorrect dublicate pipe operator at the outset of a script.

A useful tool for reproducibility is the reprex package. Its main function reprex() tests of lines of R code to check if they are reproduible, and provides markdown output to facilitate communication on sites such as GitHub. See reprex.tidyverse.org/ for details.

12.2 Geographic algorithms

Algorithms can be understood as the computing equivalent of a cooking recipe. An algorithm is a series of steps which, when taken on appropriate ingredients, results in an output that is more useful (or tasty) than the raw ingredients. Before considering ‘geoalgorithms’, it is worth taking a brief detour to understand how algorithms relate to scripts and functions which are covered next.

The word algorithm comes from Baghdad when, in the 9th Century AD, an early maths book was published called Hisab al-jabr w’al-muqabala. The book was translated into Latin and became so popular that the author Muḥammad ibn Mūsā al-Khwārizmī “was immortalized as a scientific term: Al-Khwarizmi [sic] became Alchoarismi, Algorismi and, eventually, algorithm” (Bellos 2011).65

In the the computing age algorithm refers to a series of steps that take clearly defined input to produce an output. Algorithms are often first envisioned flow charts and psuedocode showing the aim of the process before being implemented in a formal language such as R. Because the same algorithm will be used many times on the different inputs it rarely makes sense to type out the entire algorithm each time: algorithms are most easily used when they are implemented inside functions (see section 12.3).

Geoalgorithms (also referred to as GIS Algorithms, in a book of the same name) are a special case of algorithm that take geographic data as input and, generally, return geographic results (Xiao 2016). A simple example is an algorithm that finds the centroid of an object. This may sound like a simple task but in fact it involves some work, even for the simple case of single polygons containing no holes. The basic representation of a polygon object is in a matrix representing the vertices between which straight lines are drawn (the first and last points must be the same, something we’ll touch on later). In this case we’ll create a polygon with 10 vertices, following an example from GIS Algorithms (see github.com/gisalgs for Python source code):

As with many computational (or other) problems, it makes sense to break the problem into smaller chunks. With this in mind, the following commands create a new triangle (T1), the first that can be split-out from the polygon and finds its centroid based on the formula \(1/3(a + b + c)\) where \(a\) to \(c\) are coordinates representing the triangles vertices:

If we calculate the centroids of all such polygons the solution should be the average x and y values of all centroids. There is one problem though: some triangles are more important (larger) than others. Therefore to find the geographic centroid we need to take the weighted mean of all sub-triangles, with weigths proportional to area. This is still straightforward computationally, with the formula to calculate the area of a triangle being:

\[ \frac{Ax ( B y − C y ) + B x ( C y − A y ) + C x ( A y − B y )} { 2 } \]

Where \(A\) to \(C\) are the triangle’s three points and \(x\) and \(y\) refer to the x and y dimensions. A translation of this formula into R code that works with the data in the matrix representation of a triangle T1 is:

This code chunk works and outputs the correct result.66 The problem with the previous code chunk is that it is very verbose and difficult to re-run on another object with the same ‘triangle matrix’ format. To make the code more generalizable, let’s convert the code into a function (something described in 12.3):

The function t_area generalizes the solution by taking any object x, assumed to have the same dimensions as the triangle represented in T1. We can verify it works not only on the triangle matrix T1 as follows:

Likewise we can create a function that find’s the triangle’s centroid:

With these functions created and tested on the first triangle of the polygon, we can we can apply the solution to many triangles, which will be created with the decido package:

Illustration of centroid calculation.

Figure 12.2: Illustration of centroid calculation.

12.3 Functions

12.4 Case study

12.5 Exercises

  1. In section 12.2 we created a function that finds the geographic centroid of a shape, which is implemented in the sf function st_centroid(). Building on this example, write a function only using base R functions that can find the total length of linestrings represented in matrix form.

References

Xiao, Ningchuan. 2016. GIS Algorithms: Theory and Applications for Geographic Information Science & Technology. 55 City Road, London. https://doi.org/10.4135/9781473921498.

Bellos, Alex. 2011. Alex’s Adventures in Numberland. London: Bloomsbury Paperbacks.


  1. The book’s title was also influential, forming the basis of the word algebra.

  2. as can be verified with the formula for the area of a triangle whose base is horizontal: area equals half of the base width times its height — \(A = B * H / 2\) — (\(10 * 10 / 2\) in this case, as can be seen in Figure 12.2).