10 Algorithms and functions

Prerequisites

Chapter 1 states that geocomputation is not only about using existing tools, but developing new ones, “in the form of shareable R scripts and functions”. The aim of this chapter is to teach how to develop reproducible scripts, functions and algorithms, key components in the process of tool creation. The chapter assumes you have an understanding of the geographic data classes introduced in Chapter 2, and have already imported the datasets needed for your work, for example from sources outlined in Chapter 7.

We will consider example R scripts for geographic data and how to make them more reproducible in section 10.1. Algorithms are recipes for modifying inputs using a series of steps, resulting in an output, as described in section 10.2. To ease sharing and reproducibility algorithms can be placed into functions, which can then be distributed either in script files or as R packages, the building blocks of reproducible code. That is the topic of section 10.3.

The algorithms developed (including finding the centroid of a polygon, an example used in this chapter) are readily available in GIS software. However, the aim is “not to duplicate what is available out there, but to show how things out there work” (Xiao 2016). This chapter takes a similar approach and is therefore the most low-level and potentially advanced (in terms of the code, not application) so far.

10.1 Scripts

If packages are the building blocks of reproducible code, scripts are the glue that holds them together. There are no strict rules on what can and cannot go into script files and nothing to prevent you from saving broken, non-reproducible code. There are, however, some rules of thumb and conventions worth following when writing R scripts, outlined below:

  • Write the script in order: just like the script of a play, scripts should have a clear order such as ‘setup’, ‘data processing’ and ‘save results’ (roughly equivalent to ‘beginning’, ‘middle’ and ‘end’ in a film).
  • Make the script reproducible: scripts will be of more use to you and others if they are self-contained and can be run by other people. This involves stating dependencies (loading required packages at the outset, like the ‘Prerequisites’ section), reading-in data from persistent sources (e.g. from a reliable website or API) and mentioning any code that must be run before running the script (e.g. with a comment # run script0.R before this).
  • Comment the script sufficiently for others (and your future self) to understand it but not so much that the comments themselves become hard to maintain: at a minimum a good script file should contain information on the purpose of the script (see Figure 10.1) and division into chunks, perhaps by appending ---- to section headings, which allows ‘folding’ of R scripts in RStudio.

Although there is no way to enforce reproducibility in R scripts, there are tools that can help. By default RStudio ‘code-checks’ R scripts and underlines faulty code with a red way line, as illustrated below:

Illustration of 'code checking' in RStudio, which identifies the incorrect dublicate pipe operator at the outset of a script.

Figure 10.1: Illustration of ‘code checking’ in RStudio, which identifies the incorrect dublicate pipe operator at the outset of a script.

A useful tool for reproducibility is the reprex package. Its main function reprex() tests of lines of R code to check if they are reproduible, and provides markdown output to facilitate communication on sites such as GitHub. See reprex.tidyverse.org/ for details.

10.2 Geographic algorithms

Algorithms can be understood as the computing equivalent of a cooking recipe: a series of instructions which, when taken on appropriate ingredients, results in an output that is more useful (or tasty) than the raw ingredients. Before considering ‘geoalgorithms’, it is worth taking a brief detour around the history of the algorithms, to understand how they relate to scripts and functions which are covered next.

The word algorithm comes from Baghdad when, in the 9th Century AD, an early maths book was published called Hisab al-jabr w’al-muqabala. The book was translated into Latin and became so popular that the author Muḥammad ibn Mūsā al-Khwārizmī “was immortalized as a scientific term: Al-Khwarizmi [sic] became Alchoarismi, Algorismi and, eventually, algorithm” (Bellos 2011).41

In the computing age algorithm refers to a series of steps that take a clearly defined input to produce an output. Algorithms often start as in flow charts and psuedocode showing the aim of the process before being implemented in a formal language such as R. Because the same algorithm will be used many times on the different inputs it rarely makes sense to type out the entire algorithm each time: algorithms are most easily used when they are implemented inside functions (see section 10.3).

Geoalgorithms are a special case: they take geographic data in and, generally, return geographic results. Also referred to as GIS algorithms and geometric algorithms, an entire academic field — Computational Geometry, a branch of computer science — is dedicated to their study and development (Berg et al. 2008). A simple example is an algorithm that finds the centroid of an object. This may sound like a simple task but in fact it involves some work, even for the simple case of single polygons containing no holes. The basic representation of a polygon object is in a matrix representing the vertices between which straight lines are drawn (the first and last points must be the same, something we’ll touch on later). In this case we’ll create a polygon with 5 vertices in base R, building on an example from GIS Algorithms (Xiao 2016 see github.com/gisalgs for Python code):

As with many computational (or other) problems, it makes sense to break the problem into smaller chunks. All polygons can be broken-down into a finite number of triangles, which have simple rules defining their centroid and area. With this in mind, the following commands create a new triangle (T1), that can be split-out from the polygon represented by poly_mat, and finds its centroid based on the formula \(1/3(a + b + c)\) where \(a\) to \(c\) are coordinates representing the triangles vertices:

Illustration of polygon centroid calculation problem.

Figure 10.2: Illustration of polygon centroid calculation problem.

If we calculate the centroids of all such polygons the solution should be the average x and y values of all centroids. There is one problem though: some triangles are more important (larger) than others. Therefore to find the geographic centroid we need to take the weighted mean of all sub-triangles, with weigths proportional to area. The formula to calculate the area of a triangle is:

\[ \frac{Ax ( B y − C y ) + B x ( C y − A y ) + C x ( A y − B y )} { 2 } \]

Where \(A\) to \(C\) are the triangle’s three points and \(x\) and \(y\) refer to the x and y dimensions. A translation of this formula into R code that works with the data in the matrix representation of a triangle T1 is as follows (the function abs() ensures a positive result):

This code chunk works and outputs the correct result.42 The problem with the previous code chunk is that it is very and difficult to re-run on another object with the same ‘triangle matrix’ format. To make the code more generalizable, let’s convert the code into a function (something described in 10.3) that works with any matrix represenations of a triangle that we’ll call x:

The function t_area generalizes the solution by taking any object x, assumed to have the same dimensions as the triangle represented in T1. We can verify it works on the triangle matrix T1 as follows:

Likewise we can create a function that find’s the triangle’s centroid:

The next stage is to create the second triangle and calculate its area and centroid. This is done in the code chunk below, which binds the 3rd and 4th coordinates onto the 1st, and uses the same functions we created above to calculate its area and width:

From this point it is not a big leap to see how to create the 3rd and final triangle that constitutes the polygon represented by poly_mat (see exercises below). However, an aim of algorithms is often to generalise and automate the solution. In the name of automation (and to avoid creating individual triangles manually!) we use iteration to create all triangles representing the polygon in a single line. We could use a for() loop or lapply() for this work but have chosen map() from the purrr package because it allows concise code: the . in the commands below refer to ‘each element of the object i’ (see ?purrr::map for details):

We are now in a position to calculate the total area and geographic centroid of the polygon as follows:

Is this right? We can verify the answer by converting poly_mat into a simple feature collection as follows:

Illustration of centroid calculation.

Figure 10.3: Illustration of centroid calculation.

10.3 Functions

Like algorithms functions take an input and return an output. The difference is that functions are ‘packaged’ so that are R objects in their own right. We have already seen the advantages of using functions in the previous section: the function t_area() contains the steps needed find the area of any ‘triangle matrix’ and can be called with a single line, whereas the full underlying code requires many lines of code. Functions are thus a mechanism for generalizing code. We can use the function to find the area of a triangle with a base 3 units wide and a height of 1, for example, as follows:

A useful feature of functions is that they are modular. Providing you know what the output will be, one function can be used as the building block of another. This is exactly what we will do in this section. Building on the content of the previous section, in which it was shown how the area of a polygon can be found by following a series of steps in order, this section will create a function to calculate the area of any polygon (with caveats that will become clear). This function, that we’ll call poly_centroid() will mimick the behaviour of sf::st_centroid() from the sf package, with a few additions to show how arguments work.

Low-level function such as poly_centroid() can be built-on to provide different types of output. If a common need is to return the result as an object of class sfg, for example, this can be done by creating a ‘wrapper’ function that modifies the output of poly_centroid() before returning the result:

We can verify that the output is the same as the output from sf::st_centroid() as follows:

An important concept to consider when developing your own function is type stability. Functions are type stable if they always return objects of the same class and, generally, this means returning objects of the same type as the input object. To illustrate this concept in practice we will create a type stable version poly_centroid() that always takes a matrix with 2 columns as an input and always returns a matrix with 2 columns representing x and y coordinates:

The first line of the function is an example of ‘defensive programming’: it checks the input is in the right format (a matrix with 2 columns) before proceeding. Such checks can ensure that the code is robust and does not silently fail. We can verify it works with matrix(centroid_coords, ncol = 2). To see the input data type check working we can try running the function on a matrix with 3 columns:

10.4 Case study

10.5 Exercises

  1. In section 10.2 we created a function that finds the geographic centroid of a shape, which is implemented in the sf function sf::st_centroid(). Building on this example, write a function only using base R functions that can find the total length of linestrings represented in matrix form.
  2. In section 10.3 we created a different versions of the poly_centroid() function that generated outputs of class sfg (poly_centroid_sfg()) and type-stable matrix outputs (poly_centroid_type_stable()). Further extend the function by creating a version (e.g. called poly_centroid_sf()) that is type stable (only accepts inputs of class sf) and returns sf objects (hint: you may need to convert the object x into a matrix with the command sf::st_coordinates(x).
    • Verify it works by running poly_centroid_sf(sf::st_sf(sf::st_sfc(poly_sfc)))
    • What error message do you get when you try to run poly_centroid_sf(poly_mat)?

References

Xiao, Ningchuan. 2016. GIS Algorithms: Theory and Applications for Geographic Information Science & Technology. 55 City Road, London. https://doi.org/10.4135/9781473921498.

Bellos, Alex. 2011. Alex’s Adventures in Numberland. London: Bloomsbury Paperbacks.

Berg, Mark de, Otfried Cheong, Marc van Kreveld, and Mark Overmars. 2008. Computational Geometry: Algorithms and Applications. Springer Science & Business Media.


  1. The book’s title was also influential, forming the basis of the word algebra.

  2. as can be verified with the formula for the area of a triangle whose base is horizontal: area equals half of the base width times its height — \(A = B * H / 2\) — (\(10 * 10 / 2\) in this case, as can be seen in Figure 10.3).