# 10 Scripts, algorithms and functions

## Prerequisites

This chapter primarily uses base R; the **sf** package is used to check the result of an algorithm we will develop.
It assumes you have an understanding of the geographic classes introduced in Chapter 2 and how they can be used to represent a wide range of input file formats (see Chapter 7).

## 10.1 Introduction

Chapter 1 established that geocomputation is not only about using existing tools, but developing new ones, “in the form of shareable R scripts and functions”. This chapter teaches these building blocks of reproducible code. It also introduces low-level geographic algorithms, of the type used in Chapter 9. Reading it should help you to understand how such algorithms work and to write code that can be used many times, by many people, on multiple datasets. The chapter cannot, by itself, make you a skilled programmer. Programming is hard and requires plenty of practice (Abelson, Sussman, and Sussman 1996):

To appreciate programming as an intellectual activity in its own right you must turn to computer programming; you must read and write computer programs — many of them.

There are strong reasons for moving in that direction, however.^{53}
The advantages of reproduciblity go beyond allowing others to replicate your work:
reproducible code is often better in every way than code written to be run only once, including in terms of computational efficiency, scalability and ease of adapting and maintaining it.

Scripts are the basis of reproducible R code, a topic covered in section 10.2.
Algorithms are recipes for modifying inputs using a series of steps, resulting in an output, as described in section 10.3.
To ease sharing and reproducibility, algorithms can be placed into functions.
That is the topic of section 10.4.
The example of finding the centroid of a polygon will be used to tie these concepts together.
Chapter 5 already introduced a centroid function `st_centroid()`

, but this example highlights how seemingly simple operations are the result of comparatively complex code, affirming the following observation (Wise 2001):

One of the most intriguing things about spatial data problems is that things which appear to be trivially easy to a human being can be surprisingly difficult on a computer.

The example also reflects a secondary aim of the chapter which, following Xiao (2016), is “not to duplicate what is available out there, but to show how things out there work”.

## 10.2 Scripts

If functions distributed in packages are the building blocks of R code, scripts are the glue that holds them together, in a logical order, to create reproducible workflows.
To programming novices scripts may sound intimidating but they are simply plain text files, typically saved with an extension representing the language they contain.
R scripts are generally saved with a `.R`

extension and named to reflect what they do.
An example is `10-hello.R`

, a script file stored in the `code`

folder of the book’s repository, which contains the following two lines of code:

The lines of code may not be particularly exciting but they demonstrate the point: scripts do not need to be complicated.
Saved scripts can be called and executed in their entirety with `source()`

, as demonstrated below which shows how the comment is ignored but the instruction is executed:

There are no strict rules on what can and cannot go into script files and nothing to prevent you from saving broken, non-reproducible code.^{54}
There are, however, some conventions worth following:

- Write the script in order: just like the script of a film, scripts should have a clear order such as ‘setup’, ‘data processing’ and ‘save results’ (roughly equivalent to ‘beginning’, ‘middle’ and ‘end’ in a film).
- Comment the script sufficiently for others (and your future self) to understand it, but not too much. At a minimum a comment should state the purpose of the script (see Figure 10.1) and (for long scripts) divide it into sections (e.g. with
`Ctrl+Shift+R`

in RStudio which creates comments ending in`----`

that can be ‘folded’ in the editor). - Above all scripts should be reproducible: self-contained scripts that will work on any computer are more useful than scripts that only run on your computer, on a good day. This involves attaching required packages at the beginning, reading-in data from persistent sources (e.g. from a reliable website or API) and ensuring that previous steps have been taken.
^{55}

It is hard to enforce reproducibility in R scripts, but there are tools that can help. By default RStudio ‘code-checks’ R scripts and underlines faulty code with a red wavy line, as illustrated below:

**reprex**package. Its main function

`reprex()`

tests lines of R code to check if they are reproducible, and provides markdown output to facilitate communication on sites such as GitHub.
See reprex.tidyverse.org/ for details.
The contents of this section apply to any type of R script.
A particular consideration with scripts for geocomputation is that they tend to have external dependencies, such as the QGIS dependency to run code in Chapter 9, and require input data in a specific format.
Such dependencies should be mentioned as comments in the script or elsewhere in the project of which it is a part, as illustrated in the script `10-centroid-alg.R`

.
The work undertaken by this script is demonstrated in the reproducible example below, which works on a pre-requisite object named `poly_mat`

, a square with sides 9 units in length (the meaning of this will become apparent in the next section):^{56}

```
poly_mat = cbind(
x = c(0, 0, 9, 9, 0),
y = c(0, 9, 9, 0, 0)
)
source("https://git.io/10-centroid-alg.R") # short url
```

```
#> [1] "The area is: 81"
#> [1] "The coordinates of the centroid are: 4.5, 4.5"
```

## 10.3 Geographic algorithms

Algorithms can be understood as the computing equivalent of a cooking recipe. They are a complete set of instructions which, when undertaken on the input (ingredients) , result in useful (tasty) outputs. Before diving into a concrete case study, a brief history will show how they relate to scripts (covered in section 10.2) and functions (which can be used to generalize algorithms, as we’ll see in section 10.4).

The word algorithm originated in 9^{th} Century Baghdad with the publication of *Hisab al-jabr w’al-muqabala*, an early maths textbook.
The book was translated into Latin and became so popular that the author’s last name al-Khwārizmī “was immortalized as a scientific term: Al-Khwarizmi
became Alchoarismi, Algorismi and, eventually, algorithm” (Bellos 2011).
In the computing age algorithm refers to a series of steps that solves a problem, resulting in a pre-defined output.
Inputs must be formally defined in a suitable data structure (Wise 2001).
Algorithms often start as flow charts or psuedocode showing the aim of the process before being implemented in code.
To ease usability, common algorithms are often packaged inside functions, which may hide some or all of the steps taken (unless you look at the function’s source code, see section 10.4).

Geoalgorithms, such as those we encountered in Chapter 9, are algorithms that take geographic data in and, generally, return geographic results (alternative terms for the same thing include *GIS algorithms* and *geometric algorithms*).
That may sound simple but it is a deep subject with an entire academic field — *Computational Geometry*, a branch of computer science — dedicated to their study (Berg et al. 2008).

An example is an algorithm that finds the centroid of a polygon. There are many approaches to centroid calculation, some of which work only on specific types of spatial data. For the purposes of this section, we choose an approach that is easy to visualize: breaking the polygon into many triangles and finding the centroid of each of these, an approach discussed by Kaiser and Morin (1993) alongside other centroid algorithms. It helps to further break-down this approach into discrete tasks before writing any code (subsequently referred to as step 1 to step 4, these could also be presented as a schematic diagram or pseudocode):

- Divide the polygon into contiguous triangles
- Find the centroid of each triangle
- Find the area of each triangle
- Find the area-weighted mean of triangle centroids

These steps may sound straightforward, but converting words into working code requires some work and plenty of trial-and-error, even when the inputs are constrained:
The algorithm will only work for *convex polygons*, which contain no internal angles greater than 180°, no star shapes allowed (packages **decido** and **sfdct** can triangulate non-convex polygons using external libraries, as shown in the algorithms vignette at geocompr.github.io).

The simplest data structure of a polygon is a matrix of x and y coordinates in which each row represents a vertex tracing the polygon’s border in order where the first and last rows are identical (Wise 2001).
In this case we’ll create a polygon with 5 vertices in base R, building on an example from *GIS Algorithms* (Xiao 2016 see github.com/gisalgs for Python code), as illustrated in Figure 10.2:

```
x_coords = c(10, 0, 0, 12, 20, 10)
y_coords = c(0, 0, 10, 20, 15, 0)
poly_mat = cbind(x_coords, y_coords)
```

Now we have an example dataset we are ready to undertake step 1 outlined above.
The code below shows how this can be done by creating a single triangle (`T1`

), that demonstrates the method; it also demonstrates step 2 by calculating its centroid based on the formula \(1/3(a + b + c)\) where \(a\) to \(c\) are coordinates representing the triangle’s vertices:

```
O = poly_mat[1, ] # create a point representing the origin
T1 = rbind(O, poly_mat[2:3, ], O) # create 'triangle matrix'
C1 = (T1[1, ] + T1[2, ] + T1[3, ]) / 3 # find centroid
```

Step 3 is to find the area of each triangle, so a *weighted mean* accounting for the disproportionate impact of large triangles is accounted for.
The formula to calculate the area of a triangle is as follows (Kaiser and Morin 1993):

\[ \frac{Ax ( B y − C y ) + B x ( C y − A y ) + C x ( A y − B y )} { 2 } \]

Where \(A\) to \(C\) are the triangle’s three points and \(x\) and \(y\) refer to the x and y dimensions.
A translation of this formula into R code that works with the data in the matrix representation of a triangle `T1`

is as follows (the function `abs()`

ensures a positive result):

```
abs(T1[1, 1] * (T1[2, 2] - T1[3, 2]) +
T1[2, 1] * (T1[3, 2] - T1[1, 2]) +
T1[3, 1] * (T1[1, 2] - T1[2, 2]) ) / 2
#> [1] 50
```

This code chunk outputs the correct result.^{57}
The problem is that code is clunky and must by re-typed if we want to run it on another triangle matrix.
To make the code more generalizable, we will see how it can be converted into a function in the next section (10.4).

Step 4 requires steps 2 and 3 to be undertaken not just on one triangle (as demonstrated above) but on all triangles.
This requires *iteration* to create all triangles representing the polygon, illustrated in Figure 10.3.
`lapply()`

and `vapply()`

are used to iterate over each triangle here because they provide a concise solution in base R:^{58}

```
i = 2:(nrow(poly_mat) - 2)
T_all = lapply(i, function(x) {
rbind(O, poly_mat[x:(x + 1), ], O)
})
C_list = lapply(T_all, function(x) (x[1, ] + x[2, ] + x[3, ]) / 3)
C = do.call(rbind, C_list)
A = vapply(T_all, function(x) {
abs(x[1, 1] * (x[2, 2] - x[3, 2]) +
x[2, 1] * (x[3, 2] - x[1, 2]) +
x[3, 1] * (x[1, 2] - x[2, 2]) ) / 2
}, FUN.VALUE = double(1))
```

We are now in a position to complete step 4 to calculate the total area with `sum(A)`

and the centroid coordinates of the polygon with `weighted.mean(C[, 1], A)`

and `weighted.mean(C[, 2], A)`

(exercise for alert readers: verify these commands work).
To demonstrate the link between algorithms and scripts the contents of this section have been condensed into `10-centroid-alg.R`

.
We saw at the end of section 10.2 how this script can calculate the centroid of a square.
The great thing about *scripting* the algorithm is that it works on the new `poly_mat`

object (see exercises below to verify these results with reference to `st_centroid()`

):

```
source("code/10-centroid-alg.R")
#> [1] "The area is: 245"
#> [1] "The coordinates of the centroid are: 8.83, 9.22"
```

The example above shows that low-level geographic operations *can* be developed from first principles with base R.
It also shows that if a tried-and-tested solution already exists, it may not be worth re-inventing the wheel:
if our aim was simply to find the centroid of a polygon it would have been quicker to represent `poly_mat`

as an **sf** object and use the pre-existing `sf::st_centroid()`

function instead.
However, the great benefit of writing algorithms from 1^{st} principles is that you will understand every step of the process, something that cannot be guaranteed when using other peoples’ code.
A further consideration is performance: R is slow compared with low level languages such as C++ for number crunching (see section 1.3) and optimization is difficult.
Still, if the aim is to develop new methods computational efficiency should not be a primary consideration, as encapsulated in the saying “premature optimization is the root of all evil (or at least most of it) in programming” (Knuth 1974).

Algorithm development is hard.
This should be apparent from the amount of work that has gone into developing a centroid algorithm in base R that is just one, rather inefficient, approach to the problem with limited real-world applications (in the real world convex polygons are uncommon).
The experience should lead to an appreciation of low-level geographic libraries such as GEOS (which underlies `sf::st_centroid()`

) and CGAL (the Computational Geometry Algorithms Library) which not only run fast but work on a wide range of input geometry types.
A great advantage of the open source nature of such libraries is that their source code is readily available for study, comprehension and (for those with the skills and confidence) modification.^{59}

## 10.4 Functions

Like algorithms, functions take an input and return an output. The difference is that functions are ‘first class’ objects in R and are more flexible than scripts. We can, for example, create function that undertakes step 2 of our centroid generation algorithm as follows:

The above example demonstrates two key components of functions:
1) the function *body*, the code inside the curly brackets that define what the function does with the inputs; and 2) the *formals*, the list of arguments the function works with — `x`

in this case (the third key component, the environment, is beyond the scope of this section).
By default, functions return the last object that has been calculated (the coordinates of the centroid in the case of `t_centroid()`

).^{60}

The function now works on any inputs you pass it, as illustrated in the below command which calculates the area of the 1^{st} triangle from the example polygon in the previous section (see Figure 10.3):

We can also create a function to calculate a triangle’s area, which we will name `t_area()`

:

```
t_area = function(x) {
abs(
x[1, 1] * (x[2, 2] - x[3, 2]) +
x[2, 1] * (x[3, 2] - x[1, 2]) +
x[3, 1] * (x[1, 2] - x[2, 2])
) / 2
}
```

Note that after the function’s creation, a triangle’s area can be calculated in a single line of code, avoiding duplication of verbose code:
functions are a mechanism for *generalizing* code.
The newly created function `t_area()`

takes any object `x`

, assumed to have the same dimensions as the ‘triangle matrix’ data structure we’ve been using, and returns its area, as illustrated on `T1`

as follows:

We can test the generalizability of the function by using it to find the area of a new triangle matrix, which has a height of 1 and a base of 3:

A useful feature of functions is that they are modular.
Providing you know what the output will be, one function can be used as the building block of another.
Thus, the functions `t_centroid()`

and `t_area()`

can be used as sub-components of a larger function to do the work of the script `10-centroid-alg.R`

: calculate the area of any convex polygon.
The code chunk below creates the function `poly_centroid()`

to mimic the behavior of `sf::st_centroid()`

:

```
poly_centroid = function(x) {
i = 2:(nrow(x) - 2)
T_all = T_all = lapply(i, function(x) {
rbind(O, poly_mat[x:(x + 1), ], O)
})
C_list = lapply(T_all, t_centroid)
C = do.call(rbind, C_list)
A = vapply(T_all, t_area, FUN.VALUE = double(1))
c(weighted.mean(C[, 1], A), weighted.mean(C[, 2], A))
}
```

Functions such as `poly_centroid()`

can further be built-on to provide different types of output.
To return the result as an object of class `sfg`

, for example, a ‘wrapper’ function can be used to modify the output of `poly_centroid()`

before returning the result:

```
poly_centroid_sfg = function(x) {
centroid_coords = poly_centroid(x)
sf::st_point(centroid_coords)
}
```

We can verify that the output is the same as the output from `sf::st_centroid()`

as follows:

## 10.5 Exercises

- Read the script
`10-centroid-alg.R`

in the`code`

folder of the book’s GitHub repo.- Which of the best practices covered in section 10.2 does it follow?
- Create a version of the script on your computer in an IDE such as RStudio (preferably by typing-out the script line-by-line, in your own coding style and with your own comments, rather than copy-pasting — this will help you learn how to type scripts). Using the example of a square polygon (e.g. created with
`poly_mat = cbind(x = c(0, 0, 9, 9, 0), y = c(0, 9, 9, 0, 0))`

) execute the script line-by-line. - What changes could be made to the script to make it more reproducible?
- How could the documentation be improved?

- In section 10.3 we calculated that the area and geographic centroid of the polygon represented by
`poly_mat`

was 245 and 8.8, 9.2, respectively.- Reproduce the results on your own computer with reference to the script
`10-centroid-alg.R`

, an implementation of this algorithm (bonus: type out the commands - try to avoid copy-pasting). - Are the results correct? Verify them by converting
`poly_mat`

into an`sfc`

object (named`poly_sfc`

) with`st_polygon()`

(hint: this function takes objects of class`list()`

) and then using`st_area()`

and`st_centroid()`

.

- Reproduce the results on your own computer with reference to the script
- It was stated that the algorithm we created only works for
*convex hulls*. Define convex hulls (see Chapter 5) and test the algorithm on a polygon that is*not*a convex hull.- Bonus 1: Think about why the method only works for convex hulls and note changes that would need to be made to the algorithm for other types of shape to be calculated.
- Bonus 2: Building on the contents of
`10-centroid-alg.R`

, write an algorithm only using base R functions that can find the total length of linestrings represented in matrix form.

- In section 10.4 we created different versions of the
`poly_centroid()`

function that generated outputs of class`sfg`

(`poly_centroid_sfg()`

) and type-stable`matrix`

outputs (`poly_centroid_type_stable()`

). Further extend the function by creating a version (e.g. called`poly_centroid_sf()`

) that is type stable (only accepts inputs of class`sf`

)*and*returns`sf`

objects (hint: you may need to convert the object`x`

into a matrix with the command`sf::st_coordinates(x)`

).- Verify it works by running
`poly_centroid_sf(sf::st_sf(sf::st_sfc(poly_sfc)))`

- What error message do you get when you try to run
`poly_centroid_sf(poly_mat)`

?

- Verify it works by running

### References

Abelson, Harold, Gerald Jay Sussman, and Julie Sussman. 1996. *Structure and Interpretation of Computer Programs*. Second. The Mit Electrical Engineering and Computer Science Series. Cambridge, Massachusetts: MIT Press.

Wise, Stephen. 2001. *GIS Basics*. CRC Press.

Xiao, Ningchuan. 2016. *GIS Algorithms: Theory and Applications for Geographic Information Science & Technology*. 55 City Road, London. https://doi.org/10.4135/9781473921498.

Bellos, Alex. 2011. *Alex’s Adventures in Numberland*. London: Bloomsbury Paperbacks.

Berg, Mark de, Otfried Cheong, Marc van Kreveld, and Mark Overmars. 2008. *Computational Geometry: Algorithms and Applications*. Springer Science & Business Media.

Kaiser, M.J., and T.L. Morin. 1993. “Algorithms for Computing Centroids.” *Computers & Operations Research* 20 (2): 151–65. https://doi.org/10.1016/0305-0548(93)90071-P.

Knuth, Donald E. 1974. “Computer Programming as an Art.” *Commun. ACM* 17 (12): 667–73. https://doi.org/10.1145/361604.361612.

This chapter does not teach programming itself, but the guidance and exercises within should provide a foundation for learning to program. For more on programming see Wickham (2014a), Gillespie and Lovelace (2016) and, for a GIS-focussed book based on Python, Xiao (2016). ↩

Lines of code that do not contain valid R should be commented to prevent errors, as with line 1 of the

`10-hello.R`

script.↩Prior steps can be referred to with a comment or with an if statement such as

`if(!exists("x")) source("x.R")`

(which would run the script file`x.R`

if the object`x`

is missing).↩This example shows that

`source()`

works with URLs (a shortened version is used here), assuming you have an internet connection. If you do not, the same script can be called with`source("code/10-centroid-alg.R")`

, assuming you are running R from the root directory of the`geocompr`

folder, which can downloaded from https://github.com/Robinlovelace/geocompr.↩The result can be verified using the formula for the area of a triangle whose base is horizontal, as is the case for T1 (see Figure 10.3): area is half of the base width times its height or \(A = B * H / 2\). In this case \(10 * 10 / 2 = 50\).↩

See

`?lapply`

for documentation. Alternative functions for iteration include`map()`

from the**purrr**package or a`for()`

loop (see Chapter 13).`do.call()`

is used in the code chunk as a base R equivalent of`dplyr::bind_rows()`

: it coerces the list elements into a single matrix. ↩The CGAL function

`CGAL::centroid()`

is in fact composed of 7 sub-functions as described at https://doc.cgal.org/latest/Kernel_23/group__centroid__grp.html allowing it to work on a wide range of input data types, whereas the solution we created works only on a very specific input data type. The source code underlying GEOS function`Centroid::getCentroid()`

can be found at https://github.com/libgeos/geos/search?q=getCentroid.↩You can also explicitly set the output of a function by adding

`return(output)`

into the body of the function, where`output`

is the result to be returned.↩