Like the introduction, this concluding chapter contains few code chunks. But its prerequisites are demanding. It assumes that you have:
- Read-through and attempted the exercises in all the chapters of Part 1 (Foundations).
- Grasped the diversity of methods that build on these foundations, by following the code and prose in Part 2 (Extensions).
- Considered how you can use geocomputation to solve real-world problems, at work and beyond, after engaging with Part 3 (Applications).
The aim of this chapter is to synthesize the contents, with reference to recurring themes/concepts, and to inspire future directions of application and development. Section 15.2 discusses the wide range of options for handling geographic data in R. Choice is a key feature of open source software; the section provides guidance on choosing between the various options. Section 15.3 describes gaps in the book’s contents and explains why some areas of research were deliberately omitted while others were emphasized. This discussion leads to the question (which is answered in section 15.4): having read this book, where next? Section 15.5 returns to the wider issues raised in Chapter 1. In it we consider geocomputation as part of a wider ‘open source approach’ that ensures methods are publicly accessible, reproducible and supported by collaborative communities. This final section of the book also provides some pointers on how to get involved.
15.2 Package choice
A characteristic of R is that there are often multiple ways to achieve the result. The code chunk below illustrates this by using three functions, covered in Chapters 3 and 5, to combine the 16 regions of New Zealand into a single geometry:
Although the classes, attributes and column names of the resulting objects
nz_u3 differ, their geometries are identical.
This is verified using the base R function
Which to use?
It depends: the former only processes the geometry data contained in
nz so is faster, while the other options performed attribute operations, which may be useful for subsequent steps.
The wider point is that there are often multiple options to choose from when working with geographic data in R, even within a single package. The range of options grows further when more R packages are considered: you could achieve the same result using the older sp package, for example. We recommend using sf and the other packages showcased in this book, for reasons outlined in Chapter 2, but it’s worth being aware of alternatives and being able to justify your choice of software.
A common (and sometimes controversial) choice is between tidyverse and base R approaches.
We cover both and encourage you to try both before deciding which is more appropriate for different tasks.
The following code chunk, described in Chapter 3, shows how attribute data subsetting works in each approach, using the base R operator
[ and the
select() function from the tidyverse package dplyr.
The syntax differs but the results are (in essence) the same:
Again the question arises: which to use?
Again the answer is: it depends.
Each approach has advantages: the pipe syntax is popular and appealing to some, while base R is more stable, and is well-known to others.
Choosing between them is therefore largely a matter of preference.
However, if you do choose to use tidyverse functions to handle geographic data, beware of a number of pitfalls (see the supplementary article
tidyverse-pitfalls on the website that supports this book).
While commonly needed operators/functions (
select() for example) were covered in depth, there are hundreds of other other functions for working with geographic data which have not been mentioned, let alone demonstrated, in the book.
Chapter 1 mentions 20+ influential packages for working with geographic data, and only a handful of these are demonstrated in subsequent chapters.
There are hundreds more.
176 packages are mentioned in the Spatial Task View alone (as of October 2018);
more packages and countless functions for geographic data are developed each year, making it impractical to do justice to all of them in a single book.
The rate of evolution in R’s spatial ecosystem may seem overwhelming but there are strategies to deal with the wide range of options. Our advice is to start by learning one approach in depth but to have a general understand of the breadth of options available. This advice applies equally to solving geographic problems in R (section 15.4 covers developments in other languages) as it does to other fields of knowledge and application.
Of course, some packages perform much better than others, making package selection an important decision. From this diversity, we have focused on packages that are future-proof (they will work long into the future), high performance (relative to other R packages) and complimentary. But there is still overlap in the packages we have used, as illustrated by the diversity of packages for making maps, for example (see Chapter 8).
Package overlap is not necessarily a bad thing. It can increase resilience, performance (partly driven by friendly competition and mutual learning between developers) and choice, a key feature of open source software. In this context the decision to use a particular approach, such as the sf/tidyverse/raster ecosystem advocated in this book should be made with knowledge of alternatives. The sp/rgdal/rgeos ecosystem that sf is designed to supersede, for example, can do many of the things covered in this book and, due to its age, is built-on by many other packages.80 Although best known for point pattern analysis, the spatstat package also supports raster and other vector geometries (Baddeley and Turner 2005). At the time of writing (October 2018) 69 packages depend on it, making it more than a package: spatstat is an alternative R-spatial ecosystem.
It is also being aware of promising alternatives that are under development. The package stars, for example, provides a new class system for working with spatiotemporal data. If you are interested in this topic, you can check for updates on the package’s source code and the broader SpatialTemporal Task View. The same principle applies to other domains: it is important to justify software choices and review software decisions based on up-to-date information.
15.3 Gaps and overlaps
There are a number of gaps in, and some overlaps between, the topics covered in this book. We have been selective, emphasizing some topics while omitting others. We have tried to emphasize topics that are most commonly needed in real-world applications such as geographic data operations, projections, data read/write and visualization. These topics appear repeatedly in the chapters, a substantial area of overlap designed to consolidate these essential skills for geocomputation.
On the other hand, we have omitted topics that are less commonly used, or which are covered in-depth elsewhere. Statistical topics including point pattern analysis, spatial interpolation (kriging) and spatial epidemiology, for example, are only mentioned with reference to other topics such as the machine learning techniques covered in Chapter 11 (if at all). There is already excellent material on these methods, including statistically orientated chapters in Bivand, Pebesma, and Gómez-Rubio (2013) and a book on point pattern analysis by Baddeley, Rubak, and Turner (2015). Other topics which received limited attention were remote sensing and using R alongside (rather than as a bridge to) dedicated GIS software. There are many resources on these topics, including Wegmann, Leutner, and Dech (2016) and the GIS-related teaching materials available from Marburg University.
Instead covering spatial statistical modeling and inference techniques, we focussed on machine learning (see Chapters 11 and 14). Again, the reason was that there are already excellent resources on these topics, especially with ecological use cases, including Zuur et al. (2009), Zuur et al. (2017) and freely available teaching material and code on Geostatistics & Open-source statistical computing by David Rossiter, hosted at css.cornell.edu/faculty/dgr2. There are also excellent resources on spatial statistics using Bayesian modeling, a powerful framework for modeling and uncertainty estimation (Blangiardo and Cameletti 2015; Krainski et al. 2018).
Finally, we have largely omitted big data analytics. This might seem surprising since especially geographic data can become big really fast. But the prerequisite for doing big data analytics is to know how to solve a problem on a small dataset. Once you have learned that you can apply the exact same techniques on big data questions, though of course you need to expand your toolbox. The first thing to learn is to handle geographic data queries. This is because big data analytics often boil down to extracting a small amount of data from a database for a specific statistical analysis. For this, we have provided an introduction to spatial databases and how to use a GIS from within R in chapter 9. If you really have to do the analysis on a big or even the complete dataset, hopefully, the problem you are trying to solve is embarrassingly parallel. For this, you need to learn a system that is able to do this parallelization efficiently such as Hadoop, GeoMesa (http://www.geomesa.org/) or GeoSpark (http://geospark.datasyslab.org/; Huang et al. 2017). But still, you are applying the same techniques and concepts you have used on small datasets to answer a big data question, the only difference is that you then do it in a big data setting.
15.4 Where next?
As indicated in the previous sections, the book has covered only a fraction of the R’s geographic ecosystem, and there is much more to discover. We have progressed quickly, from geographic data models in Chapter 2, to advanced applications in Chapter 14. Consolidation of skills learned, discovery of new packages and approaches for handling geographic data, and application of the methods to new datasets and domains are suggested future directions. This section expands on this general advice by suggesting specific ‘next steps’, highlighted in bold below.
In addition to learning about further geographic methods and applications with R, for example with reference to the work cited in the previous section, deepening your understanding of R itself is a logical next step.
R’s fundamental classes such as
matrix are the foundation of
raster classes so studying them will improve your understanding of geographic data.
This can be done with reference to documents that are part of R, and which can be found with the command
help.start() and additional resources on the subject such as those by Wickham (2014a) and Chambers (2016).
There is more to geocomputation than software, however. We can recommend exploring and learning new research topics and methods from academic and theoretical perspectives. Many methods that have been written about have yet to be implemented. Learning about geographic methods and potential applications can therefore be rewarding, before writing any code. An example of geographic methods that are increasingly implemented in R is sampling strategies for scientific applications. A next step in this case is to read-up on relevant articles in the area such as Brus (2018), which is accompanied by reproducible code and tutorial content hosted at github.com/DickBrus/TutorialSampling4DSM.
15.5 The open source approach
This is a technical book so it makes sense for the next steps, outlined in the previous section, to also be technical. However there are wider issues worth considering in this final section, which returns to our definition of geocomputation. One of elements of the term introduced in Chapter 1 was that geographic methods should have a positive impact. Of course, how to define and measure ‘positive’ is a subjective, philosophical question, beyond the scope of this book. Regardless of your worldview, consideration the impacts of geocomputational work is a useful exercise: the potential for positive impacts can provide a powerful motivation for future learning and, conversely, new methods can open-up many possible fields of application. These considerations lead to the conclusion that geocomputation is part of a wider ‘open source approach’, engagement with which can lead to tangible benefits for the people and organizations doing geocomputation and the wider community.
As we saw in section 1.1, other terms, including geographic information systems (GIS) and geographic data science (GDS), capture the range of possibilities opened-up by geospatial software. But geocomputation has advantages: it concisely captures the ‘computational’ way of working with geographic data advocated in this book — implemented in code and therefore encouraging reproducibility — and builds on desirable ingredients of its early definition (Openshaw and Abrahart 2000):
- The creative use of geographic data.
- Application to real-world problems.
- Building ‘scientific’ tools.
We added the final ingredient: reproducibility was barely mentioned in early work on geocompuation, yet a strong case can be made for it being a vital component of the first two ingredients. Reproducibility supports creativity, encouraging the focus of methods to shift away from the basics (which are readily available through shared code, avoiding many people ‘reinventing the wheel’) and towards applications. And reproducibility encourages real world applications because it ensures that methods developed for one purpose (perhaps purely academic) can be used for practical applications.
If reproducibility is the defining feature of geocomputation (or command-line GIS, code-driven geographic data analysis, or any other synonym for the same thing) it is worth considering what makes it reproducible. This brings us to the ‘open source approach’, which has three main components:
- A command-line interface (CLI), encouraging scripts recording geographic work to be shared and reproduced.
- Open source software, which can be inspected and potentially improved by anyone in the world.
- An active developer community, which collaborates and self-organizes to build complimentary and modular tools.
Like the term geocomputation, the open source approach is more than a technical entity. It is a community composed of people interacting daily with shared aims: to produce high performance tools, free from commercial or legal restrictions, that are accessible for anyone to use. The open source approach to working with geographic data has advantages that transcend the technicalities of how the software works, encouraging learning, collaboration and an efficient division of labor.
There are many ways to engage in this community, especially with the emergence of code hosting sites such as GitHub, which encourage communication and collaboration.
A good place to start is simply browsing through some of the source code, ‘issues’ and ‘commits’ in a geographic package of interest.
A quick glance at the
r-spatial/sf GitHub repository which hosts the code underlying the sf package, for example, shows that it has 40+ ‘contributors’ (people who have committed code improving the package) and dozens more people contributing by raising issues on its ‘issue tracker’.
More than 600 issues have been closed, documenting a huge amount of work that has gone into making the package faster, more stable and user-friendly.
Considering that sf is only one (relatively small) component in the wider R-spatial community provides a sense of the scale of the intellectual operation underway to make geocomputation with R possible at all, and continuously evolving.
It is fun and instructive watch the incessant development activity happen in public fora such as GitHub but it is even more rewarding to become an active participant. This is one of the greatest features of the open source approach: it encourages people to get involved. This book itself is a result of the open source approach: it was motivated by the amazing developments in R’s geographic capabilities over the last two decades, but made practically possible by dialogue and code sharing on platforms for collaboration. We hope that in addition to disseminating useful methods for working with geographic data, this book inspires you to take a more open source approach. Whether it’s raising a constructive issue alerting developers to problems in their package; making the work done by you and the organizations you work for open; or simply helping other people by passing on the knowledge you’ve learned, getting involved can be a rewarding experience.
Baddeley, Adrian, and Rolf Turner. 2005. “Spatstat: An R Package for Analyzing Spatial Point Patterns.” Journal of Statistical Software 12 (6): 1–42.
Bivand, Roger, Edzer J Pebesma, and Virgilio Gómez-Rubio. 2013. Applied Spatial Data Analysis with R. Vol. 747248717. Springer.
Baddeley, Adrian, Ege Rubak, and Rolf Turner. 2015. Spatial Point Patterns: Methodology and Applications with R. CRC Press.
Wegmann, Martin, Benjamin Leutner, and Stefan Dech, eds. 2016. Remote Sensing and GIS for Ecologists: Using Open Source Software. Data in the Wild. Exeter: Pelagic Publishing.
Zuur, Alain, Elena N. Ieno, Neil Walker, Anatoly A. Saveliev, and Graham M. Smith. 2009. Mixed Effects Models and Extensions in Ecology with R. Statistics for Biology and Health. New York: Springer-Verlag.
Zuur, Alain F., Elena N. Ieno, Anatoly A. Saveliev, and Alain F. Zuur. 2017. Beginner’s Guide to Spatial, Temporal and Spatial-Temporal Ecological Data Analysis with R-INLA. Vol. Volume 1: Using GLM and GLMM. Newburgh, United Kingdom: Highland Statistics Ltd.
Blangiardo, Marta, and Michela Cameletti. 2015. Spatial and Spatio-Temporal Bayesian Models with R-INLA. Chichester, UK: John Wiley & Sons, Ltd. https://doi.org/10.1002/9781118950203.
Krainski, Elias, Virgilio Gómez Rubio, Haakon Bakka, amanda lenzi, Daniela Castro-Camilo, Daniel Simpson, Finn Lindgren, and Håvard Rue. 2018. Advanced Spatial Modeling with Stochastic Partial Differential Equations Using R and INLA.
Huang, Zhou, Yiran Chen, Lin Wan, and Xia Peng. 2017. “GeoSpark SQL: An Effective Framework Enabling Spatial Queries on Spark.” ISPRS International Journal of Geo-Information 6 (9): 285. https://doi.org/10.3390/ijgi6090285.
Wickham, Hadley. 2014a. Advanced R. CRC Press.
Chambers, John M. 2016. Extending R. CRC Press.
Garrard, Chris. 2016. Geoprocessing with Python. Shelter Island, NY: Manning Publications.
Brus, D. J. 2018. “Sampling for Digital Soil Mapping: A Tutorial Supported by R Scripts.” Geoderma, August. https://doi.org/10.1016/j.geoderma.2018.07.036.
Openshaw, Stan, and Robert J. Abrahart, eds. 2000. Geocomputation. 1 edition. London ; New York: CRC Press.
The first operation, undertaken by the function
st_union(), creates an object of class
sfc(a simple feature column). The latter two operations create
sfobjects, each of which contains a simple feature column. Therefore it is the geometries contained in simple feature columns, not the objects themselves, that are identical.↩
At the time of writing 452 package
Importsp, showing that its data structures are widely used and have been extended in many directions. The equivalent number for sf was 69 in October 2018; with the growing popularity of sf, this is set to grow.↩
R’s strengths relevant to our definition geocomputation include its emphasis on scientific reproducibility, widespread use in academic research and unparalleled support for statistical modeling of geographic data. Furthermore, we advocate learning one language (R) for geocomputation in depth before delving into other languages/frameworks because of the costs associated with context switching. It is preferable to have expertise in one language than basic knowledge of many.↩