# 10 Algorithms and functions

## Prerequisites

Chapter 1 states that geocomputation is not only about using existing tools, but developing new ones, “in the form of shareable R scripts and functions”. The aim of this chapter is to teach how to develop reproducible scripts, functions and algorithms, key components in the process of tool creation. The chapter assumes you have an understanding of the geographic data classes introduced in Chapter 2, and have already imported the datasets needed for your work, for example from sources outlined in Chapter 7.

We will consider example R scripts for geographic data and how to make them more reproducible in section 10.1. Algorithms are recipes for modifying inputs using a series of steps, resulting in an output, as described in section 10.2. To ease sharing and reproducibility algorithms can be placed into functions, which can then be distributed either in script files or as R packages, the building blocks of reproducible code. That is the topic of section 10.3.

The algorithms developed (including finding the centroid of a polygon, an example used in this chapter) are readily available in GIS software. However, the aim is “not to duplicate what is available out there, but to show how things out there work” (Xiao 2016). This chapter takes a similar approach and is therefore the most low-level and potentially advanced (in terms of the code, not application) so far.

## 10.1 Scripts

If packages are the building blocks of reproducible code, scripts are the glue that holds them together. There are no strict rules on what can and cannot go into script files and nothing to prevent you from saving broken, non-reproducible code. There are, however, some rules of thumb and conventions worth following when writing R scripts, outlined below:

- Write the script in order: just like the script of a play, scripts should have a clear order such as ‘setup’, ‘data processing’ and ‘save results’ (roughly equivalent to ‘beginning’, ‘middle’ and ‘end’ in a film).
- Make the script reproducible: scripts will be of more use to you and others if they are self-contained and can be run by other people. This involves stating dependencies (loading required packages at the outset, like the ‘Prerequisites’ section), reading-in data from persistent sources (e.g. from a reliable website or API) and mentioning any code that must be run before running the script (e.g. with a comment
`# run script0.R before this`

). - Comment the script sufficiently for others (and your future self) to understand it but not so much that the comments themselves become hard to maintain: at a minimum a good script file should contain information on the purpose of the script (see Figure 10.1) and division into chunks, perhaps by appending
`----`

to section headings, which allows ‘folding’ of R scripts in RStudio.

Although there is no way to enforce reproducibility in R scripts, there are tools that can help. By default RStudio ‘code-checks’ R scripts and underlines faulty code with a red way line, as illustrated below:

**reprex**package. Its main function

`reprex()`

tests of lines of R code to check if they are reproduible, and provides markdown output to facilitate communication on sites such as GitHub.
See reprex.tidyverse.org/ for details.
## 10.2 Geographic algorithms

Algorithms can be understood as the computing equivalent of a cooking recipe: a series of instructions which, when taken on appropriate ingredients, results in an output that is more useful (or tasty) than the raw ingredients. Before considering ‘geoalgorithms’, it is worth taking a brief detour around the history of the algorithms, to understand how they relate to scripts and functions which are covered next.

The word algorithm comes from Baghdad when, in the 9^{th} Century AD, an early maths book was published called *Hisab al-jabr w’al-muqabala*.
The book was translated into Latin and became so popular that the author Muḥammad ibn Mūsā al-Khwārizmī “was immortalized as a scientific term: Al-Khwarizmi [sic] became Alchoarismi, Algorismi and, eventually, algorithm” (Bellos 2011).^{41}

In the computing age algorithm refers to a series of steps that take a clearly defined input to produce an output. Algorithms often start as in flow charts and psuedocode showing the aim of the process before being implemented in a formal language such as R. Because the same algorithm will be used many times on the different inputs it rarely makes sense to type out the entire algorithm each time: algorithms are most easily used when they are implemented inside functions (see section 10.3).

Geoalgorithms are a special case: they take geographic data in and, generally, return geographic results.
Also referred to as *GIS algorithms* and *geometric algorithms*, an entire academic field — *Computational Geometry*, a branch of computer science — is dedicated to their study and development (Berg et al. 2008).
A simple example is an algorithm that finds the centroid of an object.
This may sound like a simple task but in fact it involves some work, even for the simple case of single polygons containing no holes.
The basic representation of a polygon object is in a matrix representing the vertices between which straight lines are drawn (the first and last points must be the same, something we’ll touch on later).
In this case we’ll create a polygon with 5 vertices in base R, building on an example from *GIS Algorithms* (Xiao 2016 see github.com/gisalgs for Python code):

```
x_coords = c(10, 0, 0, 2, 20, 10)
y_coords = c(0, 0, 10, 12, 15, 0)
poly_mat = cbind(x_coords, y_coords)
```

As with many computational (or other) problems, it makes sense to break the problem into smaller chunks.
All polygons can be broken-down into a finite number of triangles, which have simple rules defining their centroid and area.
With this in mind, the following commands create a new triangle (`T1`

), that can be split-out from the polygon represented by `poly_mat`

, and finds its centroid based on the formula
\(1/3(a + b + c)\) where \(a\) to \(c\) are coordinates representing the triangles vertices:

```
O = poly_mat[1, ] # create a point representing the origin
T1 = rbind(O, poly_mat[2:3, ], O) # create 'triangle matrix'
C1 = (T1[1, ] + T1[2, ] + T1[3, ]) / 3 # find centroid
```

If we calculate the centroids of all such polygons the solution should be the average x and y values of all centroids.
There is one problem though: some triangles are more important (larger) than others.
Therefore to find the geographic centroid we need to take the *weighted mean* of all sub-triangles, with weigths proportional to area.
The formula to calculate the area of a triangle is:

\[ \frac{Ax ( B y − C y ) + B x ( C y − A y ) + C x ( A y − B y )} { 2 } \]

Where \(A\) to \(C\) are the triangle’s three points and \(x\) and \(y\) refer to the x and y dimensions.
A translation of this formula into R code that works with the data in the matrix representation of a triangle `T1`

is as follows (the function `abs()`

ensures a positive result):

```
abs(T1[1, 1] * (T1[2, 2] - T1[3, 2]) +
T1[2, 1] * (T1[3, 2] - T1[1, 2]) +
T1[3, 1] * (T1[1, 2] - T1[2, 2]) ) / 2
#> [1] 50
```

This code chunk works and outputs the correct result.^{42}
The problem with the previous code chunk is that it is very and difficult to re-run on another object with the same ‘triangle matrix’ format.
To make the code more generalizable, let’s convert the code into a function (something described in 10.3) that works with any matrix represenations of a triangle that we’ll call `x`

:

```
t_area = function(x) {
abs(
x[1, 1] * (x[2, 2] - x[3, 2]) +
x[2, 1] * (x[3, 2] - x[1, 2]) +
x[3, 1] * (x[1, 2] - x[2, 2])
) / 2
}
```

The function `t_area`

generalizes the solution by taking any object `x`

, assumed to have the same dimensions as the triangle represented in `T1`

.
We can verify it works on the triangle matrix `T1`

as follows:

Likewise we can create a function that find’s the triangle’s centroid:

```
t_centroid = function(x) {
(x[1, ] + x[2, ] + x[3, ]) / 3
}
t_centroid(T1)
#> x_coords y_coords
#> 3.33 3.33
```

The next stage is to create the second triangle and calculate its area and centroid.
This is done in the code chunk below, which binds the 3^{rd} and 4^{th} coordinates onto the 1^{st}, and uses the same functions we created above to calculate its area and width:

From this point it is not a big leap to see how to create the 3^{rd} and final triangle that constitutes the polygon represented by `poly_mat`

(see exercises below).
However, an aim of algorithms is often to *generalise* and *automate* the solution.
In the name of automation (and to avoid creating individual triangles manually!) we use *iteration* to create all triangles representing the polygon in a single line.
We could use a `for()`

loop or `lapply()`

for this work but have chosen `map()`

from the **purrr** package because it allows concise code:
the `.`

in the commands below refer to ‘each element of the object `i`

’ (see `?purrr::map`

for details):

```
# Aim: create all triangles representing a polygon
i = 2:(nrow(poly_mat) - 2)
Ti = purrr::map(i, ~rbind(O, poly_mat[.:(. + 1), ], O))
A = purrr::map_dbl(Ti, ~t_area(.))
C = t(sapply(Ti, t_centroid))
```

We are now in a position to calculate the total area and geographic centroid of the polygon as follows:

Is this right?
We can verify the answer by converting `poly_mat`

into a simple feature collection as follows:

```
poly_sfc = sf::st_polygon(list(poly_mat))
sf::st_area(poly_sfc)
#> [1] 190
sf::st_centroid(poly_sfc)
#> POINT (8.04 7.33)
```

## 10.3 Functions

Like algorithms functions take an input and return an output.
The difference is that functions are ‘packaged’ so that are R objects in their own right.
We have already seen the advantages of using functions in the previous section:
the function `t_area()`

*contains* the steps needed find the area of any ‘triangle matrix’ and can be called with a single line, whereas the full underlying code requires many lines of code.
Functions are thus a mechanism for *generalizing* code.
We can use the function to find the area of a triangle with a base 3 units wide and a height of 1, for example, as follows:

A useful feature of functions is that they are modular.
Providing you know what the output will be, one function can be used as the building block of another.
This is exactly what we will do in this section.
Building on the content of the previous section, in which it was shown how the area of a polygon can be found by following a series of steps in order, this section will *create a function* to calculate the area of any polygon (with caveats that will become clear).
This function, that we’ll call `poly_centroid()`

will mimick the behaviour of `sf::st_centroid()`

from the **sf** package, with a few additions to show how arguments work.

```
poly_centroid = function(x, output = "matrix") {
i = 2:(nrow(x) - 2)
Ti = purrr::map(i, ~rbind(O, x[.:(. + 1), ], O))
A = purrr::map_dbl(Ti, ~t_area(.))
C = t(sapply(Ti, t_centroid))
centroid_coords = c(weighted.mean(C[, 1], A), weighted.mean(C[, 2], A))
if(output == "matrix") {
return(centroid_coords)
} else if(output == "area")
return(sum(A))
}
```

Low-level function such as `poly_centroid()`

can be built-on to provide different types of output.
If a common need is to return the result as an object of class `sfg`

, for example, this can be done by creating a ‘wrapper’ function that modifies the output of `poly_centroid()`

before returning the result:

```
poly_centroid_sfg = function(x) {
centroid_coords = poly_centroid(x)
centroid_sfg = sf::st_point(centroid_coords)
centroid_sfg
}
```

We can verify that the output is the same as the output from `sf::st_centroid()`

as follows:

An important concept to consider when developing your own function is *type stability*.
Functions are type stable if they always return objects of the same class and, generally, this means returning objects of the same type as the input object.
To illustrate this concept in practice we will create a type stable version `poly_centroid()`

that always takes a matrix with 2 columns as an input and always returns a matrix with 2 columns representing x and y coordinates:

```
poly_centroid_type_stable = function(x) {
stopifnot(is.matrix(x) & ncol(x) == 2)
centroid_coords = poly_centroid(x)
return(matrix(centroid_coords, ncol = 2))
}
```

The first line of the function is an example of ‘defensive programming’:
it checks the input is in the right format (a matrix with 2 columns) before proceeding.
Such checks can ensure that the code is robust and does not silently fail.
We can verify it works with `matrix(centroid_coords, ncol = 2)`

.
To see the input data type check working we can try running the function on a matrix with 3 columns:

## 10.4 Case study

## 10.5 Exercises

- In section 10.2 we created a function that finds the geographic centroid of a shape, which is implemented in the
**sf**function`sf::st_centroid()`

. Building on this example, write a function only using base R functions that can find the total length of linestrings represented in matrix form. - In section 10.3 we created a different versions of the
`poly_centroid()`

function that generated outputs of class`sfg`

(`poly_centroid_sfg()`

) and type-stable`matrix`

outputs (`poly_centroid_type_stable()`

). Further extend the function by creating a version (e.g. called`poly_centroid_sf()`

) that is type stable (only accepts inputs of class`sf`

)*and*returns`sf`

objects (hint: you may need to convert the object`x`

into a matrix with the command`sf::st_coordinates(x)`

.- Verify it works by running
`poly_centroid_sf(sf::st_sf(sf::st_sfc(poly_sfc)))`

- What error message do you get when you try to run
`poly_centroid_sf(poly_mat)`

?

- Verify it works by running

### References

Xiao, Ningchuan. 2016. *GIS Algorithms: Theory and Applications for Geographic Information Science & Technology*. 55 City Road, London. https://doi.org/10.4135/9781473921498.

Bellos, Alex. 2011. *Alex’s Adventures in Numberland*. London: Bloomsbury Paperbacks.

Berg, Mark de, Otfried Cheong, Marc van Kreveld, and Mark Overmars. 2008. *Computational Geometry: Algorithms and Applications*. Springer Science & Business Media.

The book’s title was also influential, forming the basis of the word

*algebra*.↩as can be verified with the formula for the area of a triangle whose base is horizontal: area equals half of the base width times its height — \(A = B * H / 2\) — (\(10 * 10 / 2\) in this case, as can be seen in Figure 10.3).↩