6 Geographic data I/O

Prerequisites

• This chapter requires the following packages:
library(sf)
library(raster)
library(tidyverse)
library(spData)

6.1 Introduction

This chapter is about reading and writing geographic data. Geographic data import is an essential part of geocomputational software because without data real-world applications are impossible. The skills taught in this book will enable you to add value to data meaning that, for others to benefit from the results, data output is also vital. These two processes go hand-in-hand and are referred to as I/O — short for input/output — in Computer Science (Gillespie and Lovelace 2016). Hence the title of this chapter.

Geographic data I/O is almost always part of a wider process. It depends on knowing which datasets are available, where they can be found and how to retrieve them, topics covered in section 6.2. This section demonstrates how to access open access geoportals which collectively contain many terrabytes of data. There is a wide range of geographic file formats, each of which has pros and cons. These are described in section 6.3. The process of actually reading and writing such file formats efficiently is not covered until sections 6.4 and 6.5 respectively. The final section (6.6) demonstrates methods for saving visual outputs (maps), in preparation for a subsequent chapter dedicated to visualization.

6.2 Retrieving open data

Nowadays, a vast amount of spatial data is available on the internet. Best of all, much of it is freely available. You just have to know where to look. While we cannot provide a comprehensive guide to all available geodata, we point to a few of the most important sources. Various ‘geoportals’ (web services providing geographic data such as the geospatial datasets in Data.gov) are a good place to start, providing a wide range of geographic data. Geoportals are a very useful data source but often only contain data for a specific locations (see the Wikipedia page on geoportals for a list of geoportals covering many areas of the world).

To overcome this limitation some global geoportals have been developed. The GEOSS portal, for example, provides global remote sensing data. Additional geoportals with global coverage and an emphasis on raster data include the EarthExplorer and the Copernicus Open Access Hub. A wealth of European data is available from the INSPIRE geoportal.

Typically, geoportals provide an interface that lets you query interactively the existing data (spatial and temporal extent, and product). However, in this book we encourage you to create reproducible workflows. In many cases data downloads can be scripted via download calls to static URLs or APIs (see the Sentinel API for example), saving time and enabling others to repeat and update the unglamorous data download process.

Traditionally, files have been stored on servers. You can easily download such files with the download.file() command. For example, to download National Park Service units in the United States, run:

url = file.path("http://www.naturalearthdata.com/http//www.naturalearthdata.com",
destfile = "USA_parks.zip")
unzip(zipfile = "USA_parks.zip")
usa_parks = st_read("ne_10m_parks_and_protected_lands_area.shp")

Specific R packages providing an interface to spatial libraries or geoportals are even more user-friendly (Table 6.1).

Table 6.1: Selected R packages for spatial data retrieval.
Package name Description
raster The getData() function downloads and imports administrative country, SRTM/ASTER elevation, WorldClim data.
rnaturalearth Functions to download Natural Earth vector and raster data, including world country borders.
rnoaa An R interface to National Oceanic and Atmospheric Administration (NOAA) climate data.

For example, you can get the borders of any country with the ne_countries() function from the rnaturalearth package:

library(rnaturalearth)
usa = ne_countries(country = "United States of America") # United States borders
class(usa)
#> [1] "SpatialPolygonsDataFrame"
#> attr(,"package")
#> [1] "sp"
# you can do the same with raster::getData()
# getData("GADM", country = "USA", level = 0)

As a default, rnaturalearth returns the output as a Spatial* class. You can easily convert it to the sf class with st_as_sf():

usa_sf = st_as_sf(usa)

As a second example, we will download a raster dataset. The code below downloads a series of rasters that contains global monthly precipitation sums. The spatial resolution is ten minutes. The result is a multilayer object of class RasterStack.

library(raster)
worldclim_prec = getData(name = "worldclim", var = "prec", res = 10)
class(worldclim_prec)
#> [1] "RasterStack"
#> attr(,"package")
#> [1] "raster"

A third example uses the recently developed package osmdata (Padgham et al. 2018) to find parks from the OpenStreetMap (OSM) database. As illustrated in the code-chunk below, queries begin with the function opq() (short for OpenStreetMap query), the first argument of which is bounding box, or text string representing a bounding box (the city of Leeds in this case). The result is passed to a function for selecting which OSM elements we’re interested in (parks in this case), represented by key-value pairs, which in turn is passed to the function osmdata_sf() which does the work of downloading the data and converting it into a list of sf objects (see vignette('osmdata') for further details):

library(osmdata)
parks = opq(bbox = "leeds uk") %>%
add_osm_feature(key = "leisure", value = "park") %>%
osmdata_sf()

OpenStreetMap is a vast global database of crowd-sourced data and it is growing by the minute. Although the quality is not as spatially consistent as many official datasets, OSM data have many advantages: they are globally available free of charge and using crowd-source data can encourage ‘citizen science’ and contributions back to the digital commons. Further examples of osmdata in action are provided in Chapters 7 and 8.

Finally, R packages might contain or just consist of spatial data, such as spData which provides example datasets used in this book. You can access such data with the data() function. For example, you can get a list of dataset in a package, data(package = "spData"). To attach the dataset to the global environment specify the name of a dataset (data("cycle_hire", package = "spData")). Sometimes, packages come also with the original files.37 To load such a file from the package, you need to specify the package name and the relative path to the dataset, for example:

world_raw_filepath = system.file("shapes/world.gpkg", package = "spData")
#> Reading layer wrld.gpkg' from data source /home/travis/R/Library/spData/shapes/world.gpkg' using driver GPKG'
#> Simple feature collection with 177 features and 10 fields
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: -180 ymin: -90 xmax: 180 ymax: 83.6
#> epsg (SRID):    4326
#> proj4string:    +proj=longlat +datum=WGS84 +no_defs

Find more information on getting data using R packages in section 5.5 and section 5.6 of Gillespie and Lovelace (2016).

6.3 File formats

Spatial datasets are usually stored as files or in spatial databases. File formats can either store vector or raster data, while spatial databases such as PostGIS can store both. Today file formats may seem bewildering but there has been much consolidation and standardisation since the beginnings of GIS software in the 1960s when the first widely distributed program (SYMAP) for spatial analysis was created at Harvard University (Coppock and Rhind 1991).

GDAL,38 the Geospatial Data Abstraction Library, has resolved many issues associated with incompatibility between file formats since its release in 2000. GDAL provides a unified and high-performance interface for reading and writing of many raster and vector data formats. Many open and proprietary GIS programs, including GRASS, ArcGIS and QGIS, use GDAL behind their GUIs for doing the legwork of ingesting and spitting-out geographic data in appropriate formats.

An important development ensuring greater standardisation and open-sourcing of file formats was the founding of the Open Geospatial Consortium (OGC) in 1994.39 The OGC coordinates the development of open standards for geospatial content including file formats such as KML and GeoPackage. As described in Chapter 2 the OGC publishes the simple feature data model, which underlies the vector data classes provided by sf and used in this book. Open file formats of the kind endorsed by the OGC have several advantages over proprietary formats: the standards are published, ensuring transparency and enabling innovation to improve the file formats.

There are more than 100 spatial data formats exist available to R users via GDAL. Table 6.2 presents some basic information about selected and often used spatial file formats.

Table 6.2: Selected spatial file formats.
Name Extension Info Type Model
ESRI Shapefile .shp (the main file) One of the most popular vector file formats. Consists of at least three files. The main files size cannot exceed 2 GB. It lacks support for mixed type. Column names are limited to 10 characters, and number of columns are limited at 255. It has poor support for Unicode standard. Vector Partially open
GeoJSON .geojson Extends the JSON exchange format by including a subset of the simple feature representation. Vector Open
KML .kml XML-based format for spatial visualization, developed for use with Google Earth. Zipped KML file forms the KMZ format. Vector Open
GPX .gpx XML schema created for exchange of GPS data. Vector Open
GeoTIFF .tiff GeoTIFF is one of the most popular raster formats. Its structure is similar to the regular .tif format, however, additionally stores the raster header. Raster Open
Arc ASCII .asc Text format where the first six lines represent the raster header, followed by the raster cell values arranged in rows and columns. Raster Open
R-raster .gri, .grd Native raster format of the R-package raster. Raster Open
SQLite/SpatiaLite .sqlite SQLite is a standalone, relational database management system. It is used as a default database driver in GRASS GIS 7. SpatiaLite is the spatial extension of SQLite providing support for simple features. Vector and raster Open
ESRI FileGDB .gdb Collection of spatial and nonspatial objects created in the ArcGIS software. It allows storage of multiple feature classes and enables use of topological definitions. Limited access to this format is provided by GDAL with the use of the OpenFileGDB and FileGDB drivers. Vector and raster Proprietary
GeoPackage .gpkg Lightweight database container based on SQLite allowing an easy and platform-independent exchange of geodata Vector and raster Open

6.4 Data Input (I)

Executing commands such as sf::st_read() (the main function we use for loading vector data) or raster::raster() (the main function used for loading raster data) silently sets off a chain of events that reads data from files. Moreover, there are many R packages containing a wide range of spatial data or providing simple access to different data sources. All of them load the data into R or, more precisely, assign objects to your workspace, stored in RAM accessible from the .GlobalEnv40 of your current R session.

6.4.1 Vector data

Spatial vector data comes in a wide variety of file formats, most of which can be read-in via the sf function st_read(). Behind the scenes this calls GDAL. To find out which data formats sf supports, run st_drivers(). Here, we show only the first five drivers (see Table 6.3):

sf_drivers = st_drivers()
head(sf_drivers, n = 5)
Table 6.3: Sample of available drivers for reading/writing vector data (it could vary between different GDAL versions).
name long_name write copy is_raster is_vector vsi
ESRI Shapefile ESRI Shapefile TRUE FALSE FALSE TRUE TRUE
GPX GPX TRUE FALSE FALSE TRUE TRUE
KML Keyhole Markup Language (KML) TRUE FALSE FALSE TRUE TRUE
GeoJSON GeoJSON TRUE FALSE FALSE TRUE TRUE
GPKG GeoPackage TRUE TRUE TRUE TRUE TRUE

The first argument of st_read() is dsn, which should be a text string or an object containing a single text string. The content of a text string could vary between different drivers. In most cases, as with the ESRI Shapefile (.shp) or the GeoPackage format (.gpkg), the dsn would be a file name. st_read() guesses the driver based on the file extension, as illustrated for a .gpkg file below:

vector_filepath = system.file("shapes/world.gpkg", package = "spData")
#> Reading layer wrld.gpkg' from data source /home/travis/R/Library/spData/shapes/world.gpkg' using driver GPKG'
#> Simple feature collection with 177 features and 10 fields
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: -180 ymin: -90 xmax: 180 ymax: 83.6
#> epsg (SRID):    4326
#> proj4string:    +proj=longlat +datum=WGS84 +no_defs

For some drivers, dsn could be provided as a folder name, access credentials for a database, or a GeoJSON string representation (see the examples of the st_read() help page for more details).

Some vector driver formats can store multiple data layers. By default, st_read automatically reads the first layer of the file specified in dsn, however, using the layer argument you can specify any other layer.

Naturally, some options are specific to certain drivers.41 For example, think of coordinates stored in a spreadsheet format (.csv). To read in such files as spatial objects, we naturally have to specify the names of the columns (X and Y in our example below) representing the coordinates. We can do this with the help of the options parameter. To find out about possible options, please refer to the ‘Open Options’ section of the corresponding GDAL driver description. For the comma-separated value (csv) format, visit http://www.gdal.org/drv_csv.html.

cycle_hire_txt = system.file("misc/cycle_hire_xy.csv", package = "spData")
cycle_hire_xy = st_read(cycle_hire_txt, options = c("X_POSSIBLE_NAMES=X",
"Y_POSSIBLE_NAMES=Y"))

Instead of columns describing xy-coordinates, a single column can also contain the geometry information. Well-known text (WKT), well-known binary (WKB), and the GeoJSON formats are examples of this. For instance, the world_wkt.csv file has a column named WKT representing polygons of the world’s countries. We will again use the options parameter to indicate this. Here, we will use read_sf() which does exactly the same as st_read() except it does not print the driver name to the console and stores strings as characters instead of factors.

world_txt = system.file("misc/world_wkt.csv", package = "spData")
world_wkt = read_sf(world_txt, options = "GEOM_POSSIBLE_NAMES=WKT")
# the same as
world_wkt = st_read(world_txt, options = "GEOM_POSSIBLE_NAMES=WKT",
quiet = TRUE, stringsAsFactors = FALSE)
Not all of the supported vector file formats store information about their coordinate reference system. In these situations, it is possible to add the missing information using the st_set_crs() function. Please refer also to section 2.3 for more information.

As a final example, we will show how st_read() also reads KML files. A KML file stores geographic information in XML format - a data format for the creation of web pages and the transfer of data in an application-independent way (Nolan and Lang 2014). Here, we access a KML file from the web. This file contains more than one layer. st_layers() lists all available layers. We choose the first layer Placemarks and say so with the help of the layer parameter in read_sf().

url = "https://developers.google.com/kml/documentation/KML_Samples.kml"
st_layers(url)
#> Driver: LIBKML
#> Available layers:
#>               layer_name geometry_type features fields
#> 1             Placemarks                      3     11
#> 2      Styles and Markup                      1     11
#> 3       Highlighted Icon                      1     11
#> 4        Ground Overlays                      1     11
#> 5        Screen Overlays                      0     11
#> 6                  Paths                      6     11
#> 7               Polygons                      0     11
#> 8          Google Campus                      4     11
#> 9       Extruded Polygon                      1     11
#> 10 Absolute and Relative                      4     11
kml = read_sf(url, layer = "Placemarks")

6.4.2 Raster data

Similar to vector data, raster data comes in many file formats with some of them supporting even multilayer files. raster’s raster() command reads in a single layer.

raster_filepath = system.file("raster/srtm.tif", package = "spDataLarge")
single_layer = raster(raster_filepath)

In case you want to read in a single band from a multilayer file use the band parameter to indicate a specific layer.

multilayer_filepath = system.file("raster/landsat.tif", package = "spDataLarge")
band3 = raster(multilayer_filepath, band = 3)

If you want to read in all bands, use brick() or stack().


multilayer_brick = brick(multilayer_filepath)
multilayer_stack = stack(multilayer_filepath)

Please refer to section 2.2.3 for information on the difference between raster stacks and bricks.

6.5 Data output (O)

Writing spatial data allows you to convert from one format to another and to save newly created objects. Depending on the data type (vector or raster), object class (e.g multipoint or RasterLayer), and type and amount of stored information (e.g. object size, range of values) - it is important to know how to store spatial files in the most efficient way. The next two sections will demonstrate how to do this.

6.5.1 Vector data

The counterpart of st_read() is st_write(). It allows you to write sf objects to a wide range of geographic vector file formats, including the most common such as .geojson, .shp and .gpkg. Based on the file name, st_write() decides automatically which driver to use. The speed of the writing process depends also on the driver.

st_write(obj = world, dsn = "world.gpkg")
#> Writing layer world' to data source world.gpkg' using driver GPKG'
#> features:       177
#> fields:         10
#> geometry type:  Multi Polygon

Note: if you try to write to the same data source again, the function will fail:

st_write(obj = world, dsn = "world.gpkg")
#> Updating layer world' to data source /home/travis/build/Robinlovelace/geocompr/world.gpkg' using driver GPKG'
#> Warning in CPL_write_ogr(obj, dsn, layer, driver, as.character(dataset_options), : GDAL Error 1: Layer world already exists, CreateLayer failed.
#> Use the layer creation option OVERWRITE=YES to replace it.
#> Creating layer world failed.
#> Error in CPL_write_ogr(obj, dsn, layer, driver, as.character(dataset_options), : Layer creation failed.

The error message provides some information as to why the function failed. The GDAL Error 1 statement makes clear that the failure occurred at the GDAL level. Additionally, the suggestion to use OVERWRITE=YES provides a clue about how to fix the problem. However, this is not a st_write() argument, it is a GDAL option. Luckily, st_write provides a layer_options argument through which we can pass driver-dependent options:

st_write(obj = world, dsn = "world.gpkg", layer_options = "OVERWRITE=YES")

Another solution is to use the st_write() argument delete_layer. Setting it to TRUE deletes already existing layers in the data source before the function attempts to write (note there is also a delete_dsn argument):

st_write(obj = world, dsn = "world.gpkg", delete_layer = TRUE)

You can achieve the same with write_sf() since it is equivalent to (technically an alias for) st_write(), except that its defaults for delete_layer and quiet is TRUE.

write_sf(obj = world, dsn = "world.gpkg")

The layer_options argument could be also used for many different purposes. One of them is to write spatial data to a text file. This can be done by specifying GEOMETRY inside of layer_options. It could be either AS_XY for simple point datasets (it creates two new columns for coordinates) or AS_WKT for more complex spatial data (one new column is created which contains the well-known-text representation of spatial objects).

st_write(cycle_hire_xy, "cycle_hire_xy.csv", layer_options = "GEOMETRY=AS_XY")
st_write(world_wkt, "world_wkt.csv", layer_options = "GEOMETRY=AS_WKT")

6.5.2 Raster data

The writeRaster() function saves Raster* objects to files on disk. The function expects input regarding output datatype and file format, but also accepts GDAL options specific to a selected file format (see ?writeRaster for more details).

The raster package offers nine datatypes when saving a raster: LOG1S, INT1S, INT1U, INT2S, INT2U, INT4S, INT4U, FLT4S, and FLT8S.42 The datatype determines the bit representation of the raster object written to disk (6.4). Which datatype to use depends on the range of the values of your raster object. The more values a datatype can represent, the larger the file will get on disk. Commonly, one would use LOG1S for bitmap (binary) rasters. Unsigned integers (INT1U, INT2U, INT4U) are suitable for categorical data, while float numbers (FLT4S and FLTS8S) usually represent continuous data. writeRaster() uses FLT4S as the default. While this works in most cases, the size of the output file will be unnecessarly large if you save binary or categorical data. Therefore, we would recommend to use the datatype that needs the least storage space but is still able to represent all values (check the range of values with the summary() function).

Table 6.4: Datatypes supported by the raster package.
Datatype Minimum value Maximum value
LOG1S FALSE (0) TRUE (1)
INT1S -127 127
INT1U 0 255
INT2S -32,767 32,767
INT2U 0 65,534
INT4S -2,147,483,647 2,147,483,647
INT4U 0 4,294,967,296
FLT4S -3.4e+38 3.4e+38
FLT8S -1.7e+308 1.7e+308

The file extension determines the output file when saving a Raster* object to disk. For example, the .tif extension will create a GeoTIFF file:

writeRaster(x = single_layer,
filename = "my_raster.tif",
datatype = "INT2U")

The raster file format (native to the raster package) is used when a file extension is invalid or missing. Some raster file formats come with additional options. You can use them with the options parameter.43 For example, GeoTIFF allows you to compress the output raster with the COMPRESS option44:

writeRaster(x = single_layer,
filename = "my_raster.tif",
datatype = "INT2U",
options = c("COMPRESS=DEFLATE"),
overwrite = TRUE)

Note that writeFormats() returns a list with all supported file formats on your computer.

6.6 Visual outputs

R supports many different static and interactive graphics formats. The most general method to save a static plot is to open a graphic device, create a plot, and close it, for example:

png(filename = "lifeExp.png", width = 500, height = 350)
plot(world["lifeExp"])
dev.off()

Other available graphic devices include pdf(), bmp(), jpeg(), png(), and tiff(). You can specify several properties of the output plot, including width, height and resolution.

Additionally, several graphic packages provide thier own functions to save a graphical output. For example, the tmap package has the tmap_save() function. You can save a tmap object to different graphic formats by specifying the object name and a file path to a new graphic file.

library(tmap)
tmap_obj = tm_shape(world) +
tm_polygons(col = "lifeExp")
tmap_save(tm  = tmap_obj, filename = "lifeExp_tmap.png")

On the other hand, you can save interactive maps created in the mapview package as an HTML file or image using the mapshot() function:

library(mapview)
mapview_obj = mapview(world, zcol = "lifeExp", legend = TRUE)
mapshot(mapview_obj, file = "my_interactive_map.html")

6.7 Exercises

1. List and describe three types of vector, raster, and geodatabase formats.

2. Name at least two differences between read_sf() and the more well-known function st_read().

3. Read the cycle_hire_xy.csv file from the spData package (Hint: it is located in the misc\ folder). What is a geometry type of the loaded object?

4. Download the borders of Germany using rnaturalearth, and create a new object called germany_borders. Write this new object to a file of the GeoPackage format.

5. Download the global monthly minimum temperature with a spatial resolution of five minutes using the raster package. Extract the June values, and save them to a file named tmin_june.tif file (hint: use raster::subset()).

6. Create a static map of Germany’s borders, and save it to a PNG file.

7. Create an interactive map using data from the cycle_hire_xy.csv file. Export this map to a file called cycle_hire.html.

References

Gillespie, Colin, and Robin Lovelace. 2016. Efficient R Programming: A Practical Guide to Smarter Programming. O’Reilly Media.

Padgham, Mark, Bob Rudis, Robin Lovelace, and Maëlle Salmon. 2018. Osmdata: Import ’Openstreetmap’ Data as Simple Features or Spatial Objects. https://github.com/ropensci/osmdata.

Coppock, J Terry, and David W Rhind. 1991. “The History of GIS.” Geographical Information Systems: Principles and Applications, Vol. 1. 1 (1): 21–43.

Nolan, Deborah, and Duncan Temple Lang. 2014. XML and Web Technologies for Data Sciences with R. Use R! New York, NY: Springer.

1. Data loaded with data() often is a R dataset (.Rdata ord .rda).

2. GDAL should be prounounced “goo-dal”, with the double o making a reference to object-orientation.