7 Geographic data I/O

Prerequisites

  • This chapter requires the following packages:

7.1 Introduction

This chapter is about reading and writing geographic data. Geographic data import is an essential part of geocomputational software because without data real-world applications are impossible. The skills taught in this book will enable you to add value to data meaning that, for others to benefit from the results, data output is also vital. These two processes go hand-in-hand and are referred to as I/O — short for input/output — in Computer Science (Gillespie and Lovelace 2016). Hence the title of this chapter.

Geographic data I/O is almost always part of a wider process. It depends on knowing which datasets are available, where they can be found and how to retrieve them, topics covered in section 7.2. This section demonstrates how to access open access geoportals which collectively contain many terabytes of data. There is a wide range of geographic file formats, each of which has pros and cons. These are described in section 7.4. The process of actually reading and writing such file formats efficiently is not covered until sections 7.5 and 7.6 respectively. The final section (7.7) demonstrates methods for saving visual outputs (maps), in preparation for Chapter 8 on visualization.

7.2 Retrieving open data

A vast and ever-increasing amount of spatial data is available on the internet, much of which is free to access and use (with appropriate credit given to its providers). In some ways there is now too much data in the sense that there are often multiple places to access the same dataset, some of which may be poor quality. In this context it vital know where to look. Instead of attempting to provide a comprehensive guide to this data deluge this section some of the most important sources. Various ‘geoportals’ (web services providing geospatial datasets such as Data.gov) are a good place to start, providing a wide range of data but often only for specific locations (as illustrated in the updated Wikipedia page on the topic).

Some global geoportals overcome this issue. The GEOSS portal and the Copernicus Open Access Hub, for example, contain many raster datasets with global coverage. A wealth of vector datasets can be accessed from the National Space Agency (NASA) SEDAC portal and the European Union’s INSPIRE geoportal, with global and regional coverage.

Most geoportals provide a graphical interface allowing datasets to be queried based on characteristics such spatial and temporal extent, the United States Geological Services’ EarthExplorer being a prime example. Exploring datasets interactively on a browser is an effective way of understanding available layers. Downloading data is best done with code, however, from reproducibility and efficiency perspectives. Downloads can be initiated from the command line using a variety of techniques, primarily via URLs and APIs (see the Sentinel API for example). Files hosted on static URLs can be downloaded with download.file(), as illustrated in the code chunk below which accesses US National Parks data from catalog.data.gov/dataset/national-parks:

7.3 Geographic data packages

A multitude of R packages have been developed for accessing geographic data, some of which are presented in Table 7.1). These provide interfaces to one or more spatial libraries or geoportals and aim make data access even quicker from the command line.

Table 7.1: Selected R packages for spatial data retrieval.
Package name Description
getlandsat Provides access to Landsat 8 data.
osmdata Download and import of OpenStreetMap data.
raster The getData() function downloads and imports administrative country, SRTM/ASTER elevation, WorldClim data.
rnaturalearth Functions to download Natural Earth vector and raster data, including world country borders.
rnoaa An R interface to National Oceanic and Atmospheric Administration (NOAA) climate data.
rWBclimate An access to the World Bank climate data.

It should be emphasised that Table 7.1 represents only a small of available geographic data packages. Other notable packages include GSODR, which provides Global Summary Daily Weather Data in R (see the package’s README for an overview of weather data sources); tidycensus and tigris, which provide socio-demographic vector data for the USA; and hddtools, which provides access to a range of hydrological datasets.

Each data package has its own syntax for accessing data. This diversity is demonstrated in the subsequent code chunks, which show how to get data using three packages from Table 7.1. Country borders are often useful and these can be accessed with the ne_countries() function from the rnaturalearth package as follows:

By default rnaturalearth returns objects of class Spatial. The result can be converted into an sf objects with st_as_sf() as follows:

A second example downloads a series of rasters containing global monthly precipitation sums with spatial resolution is ten minutes. The result is a multilayer object of class RasterStack.

A third example uses the osmdata package (Padgham et al. 2018) to find parks from the OpenStreetMap (OSM) database. As illustrated in the code-chunk below, queries begin with the function opq() (short for OpenStreetMap query), the first argument of which is bounding box, or text string representing a bounding box (the city of Leeds in this case). The result is passed to a function for selecting which OSM elements we’re interested in (parks in this case), represented by key-value pairs, which in turn is passed to the function osmdata_sf() which does the work of downloading the data and converting it into a list of sf objects (see vignette('osmdata') for further details):

OpenStreetMap is a vast global database of crowd-sourced data and it is growing daily. Although the quality is not as spatially consistent as many official datasets, OSM data have many advantages: they are globally available free of charge and using crowd-source data can encourage ‘citizen science’ and contributions back to the digital commons. Further examples of osmdata in action are provided in Chapters 12, 13 and 9.

Sometimes, packages come with inbuilt datasets. These can be accessed in four ways: by attaching the package (if the package uses ‘lazy loading’ as spData does), with data(dataset), by referring to the dataset with pkg::dataset or with system.file() to access raw data files. The following code chunk illustrates the latter two options using the world (already loaded by attaching its parent package with library(spData)):33

7.4 File formats

Spatial datasets are usually stored as files or in spatial databases. File formats can either store vector or raster data, while spatial databases such as PostGIS can store both (see also section 9.6.2). Today the variety of file formats may seem bewildering but there has been much consolidation and standardization since the beginnings of GIS software in the 1960s when the first widely distributed program (SYMAP) for spatial analysis was created at Harvard University (Coppock and Rhind 1991).

GDAL (which should be pronounced “goo-dal”, with the double o making a reference to object-orientation), the Geospatial Data Abstraction Library, has resolved many issues associated with incompatibility between geographic file formats since its release in 2000. GDAL provides a unified and high-performance interface for reading and writing of many raster and vector data formats. Many open and proprietary GIS programs, including GRASS, ArcGIS and QGIS, use GDAL behind their GUIs for doing the legwork of ingesting and spitting-out geographic data in appropriate formats.

GDAL provides access to more than 200 vector and raster data formats. Table 7.2 presents some basic information about selected and often used spatial file formats.

Table 7.2: Selected spatial file formats.
Name Extension Info Type Model
ESRI Shapefile .shp (the main file) One of the most popular vector file formats. Consists of at least three files. The main files size cannot exceed 2 GB. It lacks support for mixed type. Column names are limited to 10 characters, and number of columns are limited at 255. It has poor support for Unicode standard. Vector Partially open
GeoJSON .geojson Extends the JSON exchange format by including a subset of the simple feature representation. Vector Open
KML .kml XML-based format for spatial visualization, developed for use with Google Earth. Zipped KML file forms the KMZ format. Vector Open
GPX .gpx XML schema created for exchange of GPS data. Vector Open
GeoTIFF .tiff GeoTIFF is one of the most popular raster formats. Its structure is similar to the regular .tif format, however, additionally stores the raster header. Raster Open
Arc ASCII .asc Text format where the first six lines represent the raster header, followed by the raster cell values arranged in rows and columns. Raster Open
R-raster .gri, .grd Native raster format of the R-package raster. Raster Open
SQLite/SpatiaLite .sqlite SQLite is a standalone, relational database management system. It is used as a default database driver in GRASS GIS 7. SpatiaLite is the spatial extension of SQLite providing support for simple features. Vector and raster Open
ESRI FileGDB .gdb Collection of spatial and nonspatial objects created in the ArcGIS software. It allows storage of multiple feature classes and enables use of topological definitions. Limited access to this format is provided by GDAL with the use of the OpenFileGDB and FileGDB drivers. Vector and raster Proprietary
GeoPackage .gpkg Lightweight database container based on SQLite allowing an easy and platform-independent exchange of geodata Vector and raster Open

An important development ensuring the standardization and open-sourcing of file formats was the founding of the Open Geospatial Consortium (OGC) in 1994. Beyond defining the simple features data model (see section 2.1.1), the OGC also coordinates the development of open standards, for example as used in file formats such as KML and GeoPackage. Open file formats of the kind endorsed by the OGC have several advantages over proprietary formats: the standards are published, ensure transparency and open up the possibility for users to further develop and adjust the file formats to their specific needs.

ESRI’ Shapefile is the most popular vector data exchange format. However, it is not an open format (though its specification is open). It was developed in the early 1990s and has a number of limitations. First of all, it is a multi-file format, which consists of at least three files. It only supports 255 columns, column names are restricted to ten characters and the file size limit is to 2GB. Furthermore, Shapefile does not support all possible geometry types, for example, it is unable to distinguish between a polygon and a multipolygon.34 Despite these limitations, a viable alternative had been missing for a long time. In the meantime, GeoPackage emerged, and seems to be a more than suitable replacement candidate for ESRI’s Shapefile. Geopackage is a format for exchanging geospatial information and an OGC standard. The GeoPackage standard describes the rules how to store geospatial information in a tiny SQLite container. Hence, GeoPackage is a lightweight spatial database container, which allows the storage of vector and raster data but also of non-spatial data and extensions. Aside from GeoPackage there are other geospatial data exchange formats worth checking out (Table 7.2).

7.5 Data Input (I)

Executing commands such as sf::st_read() (the main function we use for loading vector data) or raster::raster() (the main function used for loading raster data) silently sets off a chain of events that reads data from files. Moreover, there are many R packages containing a wide range of spatial data or providing simple access to different data sources. All of them load the data into R or, more precisely, assign objects to your workspace, stored in RAM accessible from the .GlobalEnv of the R session.

7.5.1 Vector data

Spatial vector data comes in a wide variety of file formats, most of which can be read-in via the sf function st_read(). Behind the scenes this calls GDAL. To find out which data formats sf supports, run st_drivers(). Here, we show only the first five drivers (see Table 7.3):

Table 7.3: Sample of available drivers for reading/writing vector data (it could vary between different GDAL versions).
name long_name write copy is_raster is_vector vsi
ESRI Shapefile ESRI Shapefile TRUE FALSE FALSE TRUE TRUE
GPX GPX TRUE FALSE FALSE TRUE TRUE
KML Keyhole Markup Language (KML) TRUE FALSE FALSE TRUE TRUE
GeoJSON GeoJSON TRUE FALSE FALSE TRUE TRUE
GPKG GeoPackage TRUE TRUE TRUE TRUE TRUE

The first argument of st_read() is dsn, which should be a text string or an object containing a single text string. The content of a text string could vary between different drivers. In most cases, as with the ESRI Shapefile (.shp) or the GeoPackage format (.gpkg), the dsn would be a file name. st_read() guesses the driver based on the file extension, as illustrated for a .gpkg file below:

For some drivers, dsn could be provided as a folder name, access credentials for a database, or a GeoJSON string representation (see the examples of the st_read() help page for more details).

Some vector driver formats can store multiple data layers. By default, st_read automatically reads the first layer of the file specified in dsn, however, using the layer argument you can specify any other layer.

Naturally, some options are specific to certain drivers.35 For example, think of coordinates stored in a spreadsheet format (.csv). To read in such files as spatial objects, we naturally have to specify the names of the columns (X and Y in our example below) representing the coordinates. We can do this with the help of the options parameter. To find out about possible options, please refer to the ‘Open Options’ section of the corresponding GDAL driver description. For the comma-separated value (csv) format, visit http://www.gdal.org/drv_csv.html.

Instead of columns describing xy-coordinates, a single column can also contain the geometry information. Well-known text (WKT), well-known binary (WKB), and the GeoJSON formats are examples of this. For instance, the world_wkt.csv file has a column named WKT representing polygons of the world’s countries. We will again use the options parameter to indicate this. Here, we will use read_sf() which does exactly the same as st_read() except it does not print the driver name to the console and stores strings as characters instead of factors.

Not all of the supported vector file formats store information about their coordinate reference system. In these situations, it is possible to add the missing information using the st_set_crs() function. Please refer also to section 2.3 for more information.

As a final example, we will show how st_read() also reads KML files. A KML file stores geographic information in XML format - a data format for the creation of web pages and the transfer of data in an application-independent way (Nolan and Lang 2014). Here, we access a KML file from the web. This file contains more than one layer. st_layers() lists all available layers. We choose the first layer Placemarks and say so with the help of the layer parameter in read_sf().

7.5.2 Raster data

Similar to vector data, raster data comes in many file formats with some of them supporting even multilayer files. raster’s raster() command reads in a single layer.

In case you want to read in a single band from a multilayer file use the band parameter to indicate a specific layer.

If you want to read in all bands, use brick() or stack().

Please refer to section 2.2.3 for information on the difference between raster stacks and bricks.

7.6 Data output (O)

Writing spatial data allows you to convert from one format to another and to save newly created objects. Depending on the data type (vector or raster), object class (e.g multipoint or RasterLayer), and type and amount of stored information (e.g. object size, range of values) - it is important to know how to store spatial files in the most efficient way. The next two sections will demonstrate how to do this.

7.6.1 Vector data

The counterpart of st_read() is st_write(). It allows you to write sf objects to a wide range of geographic vector file formats, including the most common such as .geojson, .shp and .gpkg. Based on the file name, st_write() decides automatically which driver to use. The speed of the writing process depends also on the driver.

Note: if you try to write to the same data source again, the function will fail:

The error message provides some information as to why the function failed. The GDAL Error 1 statement makes clear that the failure occurred at the GDAL level. Additionally, the suggestion to use OVERWRITE=YES provides a clue about how to fix the problem. However, this is not a st_write() argument, it is a GDAL option. Luckily, st_write provides a layer_options argument through which we can pass driver-dependent options:

Another solution is to use the st_write() argument delete_layer. Setting it to TRUE deletes already existing layers in the data source before the function attempts to write (note there is also a delete_dsn argument):

You can achieve the same with write_sf() since it is equivalent to (technically an alias for) st_write(), except that its defaults for delete_layer and quiet is TRUE.

The layer_options argument could be also used for many different purposes. One of them is to write spatial data to a text file. This can be done by specifying GEOMETRY inside of layer_options. It could be either AS_XY for simple point datasets (it creates two new columns for coordinates) or AS_WKT for more complex spatial data (one new column is created which contains the well-known-text representation of spatial objects).

7.6.2 Raster data

The writeRaster() function saves Raster* objects to files on disk. The function expects input regarding output data type and file format, but also accepts GDAL options specific to a selected file format (see ?writeRaster for more details).

The raster package offers nine data types when saving a raster: LOG1S, INT1S, INT1U, INT2S, INT2U, INT4S, INT4U, FLT4S, and FLT8S.36 The data type determines the bit representation of the raster object written to disk (7.4). Which data type to use depends on the range of the values of your raster object. The more values a data type can represent, the larger the file will get on disk. Commonly, one would use LOG1S for bitmap (binary) rasters. Unsigned integers (INT1U, INT2U, INT4U) are suitable for categorical data, while float numbers (FLT4S and FLTS8S) usually represent continuous data. writeRaster() uses FLT4S as the default. While this works in most cases, the size of the output file will be unnecessarily large if you save binary or categorical data. Therefore, we would recommend to use the data type that needs the least storage space but is still able to represent all values (check the range of values with the summary() function).

Table 7.4: Data types supported by the raster package.
Data type Minimum value Maximum value
LOG1S FALSE (0) TRUE (1)
INT1S -127 127
INT1U 0 255
INT2S -32,767 32,767
INT2U 0 65,534
INT4S -2,147,483,647 2,147,483,647
INT4U 0 4,294,967,296
FLT4S -3.4e+38 3.4e+38
FLT8S -1.7e+308 1.7e+308

The file extension determines the output file when saving a Raster* object to disk. For example, the .tif extension will create a GeoTIFF file:

The raster file format (native to the raster package) is used when a file extension is invalid or missing. Some raster file formats come with additional options. You can use them with the options parameter. GeoTIFF files, for example, can be compressed using COMPRESS:

Note that writeFormats() returns a list with all supported file formats on your computer.

7.7 Visual outputs

R supports many different static and interactive graphics formats. The most general method to save a static plot is to open a graphic device, create a plot, and close it, for example:

Other available graphic devices include pdf(), bmp(), jpeg(), png(), and tiff(). You can specify several properties of the output plot, including width, height and resolution.

Additionally, several graphic packages provide their own functions to save a graphical output. For example, the tmap package has the tmap_save() function. You can save a tmap object to different graphic formats by specifying the object name and a file path to a new graphic file.

On the other hand, you can save interactive maps created in the mapview package as an HTML file or image using the mapshot() function:

7.8 Exercises

  1. List and describe three types of vector, raster, and geodatabase formats.

  2. Name at least two differences between read_sf() and the more well-known function st_read().

  3. Read the cycle_hire_xy.csv file from the spData package (Hint: it is located in the misc\ folder). What is a geometry type of the loaded object?

  4. Download the borders of Germany using rnaturalearth, and create a new object called germany_borders. Write this new object to a file of the GeoPackage format.

  5. Download the global monthly minimum temperature with a spatial resolution of five minutes using the raster package. Extract the June values, and save them to a file named tmin_june.tif file (hint: use raster::subset()).

  6. Create a static map of Germany’s borders, and save it to a PNG file.

  7. Create an interactive map using data from the cycle_hire_xy.csv file. Export this map to a file called cycle_hire.html.

References

Gillespie, Colin, and Robin Lovelace. 2016. Efficient R Programming: A Practical Guide to Smarter Programming. O’Reilly Media.

Padgham, Mark, Bob Rudis, Robin Lovelace, and Maëlle Salmon. 2018. Osmdata: Import ’Openstreetmap’ Data as Simple Features or Spatial Objects. https://CRAN.R-project.org/package=osmdata.

Coppock, J Terry, and David W Rhind. 1991. “The History of GIS.” Geographical Information Systems: Principles and Applications, Vol. 1. 1 (1): 21–43.

Nolan, Deborah, and Duncan Temple Lang. 2014. XML and Web Technologies for Data Sciences with R. Use R! New York, NY: Springer.


  1. For more information on data import with R packages see sections 5.5 and 5.6 of Gillespie and Lovelace (2016).

  2. To learn more about ESRI Shapefile limitations and possible alternative file formats, visit http://switchfromshapefile.org/.

  3. A list of supported vector formats and options can be found at http://gdal.org/ogr_formats.html.

  4. Using INT4U is not recommended as R does not support 32-bit unsigned integers.