7 Geographic data I/O
- This chapter requires the following packages:
This chapter is about reading and writing geographic data. Geographic data import is an essential part of geocomputational software because without data real-world applications are impossible. The skills taught in this book will enable you to add value to data meaning that, for others to benefit from the results, data output is also vital. These two processes go hand-in-hand and are referred to as I/O — short for input/output — in Computer Science (Gillespie and Lovelace 2016). Hence the title of this chapter.
Geographic data I/O is almost always part of a wider process. It depends on knowing which datasets are available, where they can be found and how to retrieve them, topics covered in section 7.2. This section demonstrates how to access open access geoportals which collectively contain many terabytes of data. There is a wide range of geographic file formats, each of which has pros and cons. These are described in section 7.4. The process of actually reading and writing such file formats efficiently is not covered until sections 7.5 and 7.6 respectively. The final section (7.7) demonstrates methods for saving visual outputs (maps), in preparation for Chapter 8 on visualization.
7.2 Retrieving open data
A vast and ever-increasing amount of spatial data is available on the internet, much of which is free to access and use (with appropriate credit given to its providers). In some ways there is now too much data in the sense that there are often multiple places to access the same dataset, some of which may be poor quality. In this context it vital know where to look. Instead of attempting to provide a comprehensive guide to this data deluge this section some of the most important sources. Various ‘geoportals’ (web services providing geospatial datasets such as Data.gov) are a good place to start, providing a wide range of data but often only for specific locations (as illustrated in the updated Wikipedia page on the topic).
Some global geoportals overcome this issue. The GEOSS portal and the Copernicus Open Access Hub, for example, contain many raster datasets with global coverage. A wealth of vector datasets can be accessed from the National Space Agency (NASA) SEDAC portal and the European Union’s INSPIRE geoportal, with global and regional coverage.
Most geoportals provide a graphical interface allowing datasets to be queried based on characteristics such spatial and temporal extent, the United States Geological Services’ EarthExplorer being a prime example.
Exploring datasets interactively on a browser is an effective way of understanding available layers.
Downloading data is best done with code, however, from reproducibility and efficiency perspectives.
Downloads can be initiated from the command line using a variety of techniques, primarily via URLs and APIs (see the Sentinel API for example).
Files hosted on static URLs can be downloaded with
download.file(), as illustrated in the code chunk below which accesses US National Parks data from catalog.data.gov/dataset/national-parks:
7.3 Geographic data packages
A multitude of R packages have been developed for accessing geographic data, some of which are presented in Table 7.1). These provide interfaces to one or more spatial libraries or geoportals and aim make data access even quicker from the command line.
|getlandsat||Provides access to Landsat 8 data.|
|osmdata||Download and import of OpenStreetMap data.|
|rnaturalearth||Functions to download Natural Earth vector and raster data, including world country borders.|
|rnoaa||An R interface to National Oceanic and Atmospheric Administration (NOAA) climate data.|
|rWBclimate||An access to the World Bank climate data.|
It should be emphasised that Table 7.1 represents only a small of available geographic data packages. Other notable packages include GSODR, which provides Global Summary Daily Weather Data in R (see the package’s README for an overview of weather data sources); tidycensus and tigris, which provide socio-demographic vector data for the USA; and hddtools, which provides access to a range of hydrological datasets.
Each data package has its own syntax for accessing data.
This diversity is demonstrated in the subsequent code chunks, which show how to get data using three packages from Table 7.1.
Country borders are often useful and these can be accessed with the
ne_countries() function from the rnaturalearth package as follows:
By default rnaturalearth returns objects of class
The result can be converted into an
sf objects with
st_as_sf() as follows:
A second example downloads a series of rasters containing global monthly precipitation sums with spatial resolution is ten minutes.
The result is a multilayer object of class
A third example uses the osmdata package (Padgham et al. 2018) to find parks from the OpenStreetMap (OSM) database.
As illustrated in the code-chunk below, queries begin with the function
opq() (short for OpenStreetMap query), the first argument of which is bounding box, or text string representing a bounding box (the city of Leeds in this case).
The result is passed to a function for selecting which OSM elements we’re interested in (parks in this case), represented by key-value pairs, which in turn is passed to the function
osmdata_sf() which does the work of downloading the data and converting it into a list of
sf objects (see
vignette('osmdata') for further details):
OpenStreetMap is a vast global database of crowd-sourced data and it is growing daily. Although the quality is not as spatially consistent as many official datasets, OSM data have many advantages: they are globally available free of charge and using crowd-source data can encourage ‘citizen science’ and contributions back to the digital commons. Further examples of osmdata in action are provided in Chapters 12, 13 and 9.
Sometimes, packages come with inbuilt datasets.
These can be accessed in four ways: by attaching the package (if the package uses ‘lazy loading’ as spData does), with
data(dataset), by referring to the dataset with
pkg::dataset or with
system.file() to access raw data files.
The following code chunk illustrates the latter two options using the
world (already loaded by attaching its parent package with
7.4 File formats
Spatial datasets are usually stored as files or in spatial databases. File formats can either store vector or raster data, while spatial databases such as PostGIS can store both (see also section 9.6.2). Today the variety of file formats may seem bewildering but there has been much consolidation and standardization since the beginnings of GIS software in the 1960s when the first widely distributed program (SYMAP) for spatial analysis was created at Harvard University (Coppock and Rhind 1991).
GDAL (which should be pronounced “goo-dal”, with the double o making a reference to object-orientation), the Geospatial Data Abstraction Library, has resolved many issues associated with incompatibility between geographic file formats since its release in 2000. GDAL provides a unified and high-performance interface for reading and writing of many raster and vector data formats. Many open and proprietary GIS programs, including GRASS, ArcGIS and QGIS, use GDAL behind their GUIs for doing the legwork of ingesting and spitting-out geographic data in appropriate formats.
GDAL provides access to more than 200 vector and raster data formats. Table 7.2 presents some basic information about selected and often used spatial file formats.
|ESRI Shapefile||.shp (the main file)||One of the most popular vector file formats. Consists of at least three files. The main files size cannot exceed 2 GB. It lacks support for mixed type. Column names are limited to 10 characters, and number of columns are limited at 255. It has poor support for Unicode standard.||Vector||Partially open|
|GeoJSON||.geojson||Extends the JSON exchange format by including a subset of the simple feature representation.||Vector||Open|
|KML||.kml||XML-based format for spatial visualization, developed for use with Google Earth. Zipped KML file forms the KMZ format.||Vector||Open|
|GPX||.gpx||XML schema created for exchange of GPS data.||Vector||Open|
|GeoTIFF||.tiff||GeoTIFF is one of the most popular raster formats. Its structure is similar to the regular
|Arc ASCII||.asc||Text format where the first six lines represent the raster header, followed by the raster cell values arranged in rows and columns.||Raster||Open|
|R-raster||.gri, .grd||Native raster format of the R-package raster.||Raster||Open|
|SQLite/SpatiaLite||.sqlite||SQLite is a standalone, relational database management system. It is used as a default database driver in GRASS GIS 7. SpatiaLite is the spatial extension of SQLite providing support for simple features.||Vector and raster||Open|
|ESRI FileGDB||.gdb||Collection of spatial and nonspatial objects created in the ArcGIS software. It allows storage of multiple feature classes and enables use of topological definitions. Limited access to this format is provided by GDAL with the use of the OpenFileGDB and FileGDB drivers.||Vector and raster||Proprietary|
|GeoPackage||.gpkg||Lightweight database container based on SQLite allowing an easy and platform-independent exchange of geodata||Vector and raster||Open|
An important development ensuring the standardization and open-sourcing of file formats was the founding of the Open Geospatial Consortium (OGC) in 1994. Beyond defining the simple features data model (see section 2.1.1), the OGC also coordinates the development of open standards, for example as used in file formats such as KML and GeoPackage. Open file formats of the kind endorsed by the OGC have several advantages over proprietary formats: the standards are published, ensure transparency and open up the possibility for users to further develop and adjust the file formats to their specific needs.
ESRI’ Shapefile is the most popular vector data exchange format. However, it is not an open format (though its specification is open). It was developed in the early 1990s and has a number of limitations. First of all, it is a multi-file format, which consists of at least three files. It only supports 255 columns, column names are restricted to ten characters and the file size limit is to 2GB. Furthermore, Shapefile does not support all possible geometry types, for example, it is unable to distinguish between a polygon and a multipolygon.34 Despite these limitations, a viable alternative had been missing for a long time. In the meantime, GeoPackage emerged, and seems to be a more than suitable replacement candidate for ESRI’s Shapefile. Geopackage is a format for exchanging geospatial information and an OGC standard. The GeoPackage standard describes the rules how to store geospatial information in a tiny SQLite container. Hence, GeoPackage is a lightweight spatial database container, which allows the storage of vector and raster data but also of non-spatial data and extensions. Aside from GeoPackage there are other geospatial data exchange formats worth checking out (Table 7.2).
7.5 Data Input (I)
Executing commands such as
sf::st_read() (the main function we use for loading vector data) or
raster::raster() (the main function used for loading raster data) silently sets off a chain of events that reads data from files.
Moreover, there are many R packages containing a wide range of spatial data or providing simple access to different data sources.
All of them load the data into R or, more precisely, assign objects to your workspace, stored in RAM accessible from the
.GlobalEnv of the R session.
7.5.1 Vector data
Spatial vector data comes in a wide variety of file formats, most of which can be read-in via the sf function
Behind the scenes this calls GDAL.
To find out which data formats sf supports, run
Here, we show only the first five drivers (see Table 7.3):
|ESRI Shapefile||ESRI Shapefile||TRUE||FALSE||FALSE||TRUE||TRUE|
|KML||Keyhole Markup Language (KML)||TRUE||FALSE||FALSE||TRUE||TRUE|
The first argument of
dsn, which should be a text string or an object containing a single text string.
The content of a text string could vary between different drivers.
In most cases, as with the ESRI Shapefile (
.shp) or the
GeoPackage format (
dsn would be a file name.
st_read() guesses the driver based on the file extension, as illustrated for a
.gpkg file below:
vector_filepath = system.file("shapes/world.gpkg", package = "spData") world = st_read(vector_filepath) #> Reading layer `world' from data source `/home/travis/R/Library/spData/shapes/world.gpkg' using driver `GPKG' #> Simple feature collection with 177 features and 10 fields #> geometry type: MULTIPOLYGON #> dimension: XY #> bbox: xmin: -180 ymin: -90 xmax: 180 ymax: 83.6 #> epsg (SRID): 4326 #> proj4string: +proj=longlat +datum=WGS84 +no_defs
For some drivers,
dsn could be provided as a folder name, access credentials for a database, or a GeoJSON string representation (see the examples of the
st_read() help page for more details).
Some vector driver formats can store multiple data layers.
st_read automatically reads the first layer of the file specified in
dsn, however, using the
layer argument you can specify any other layer.
Naturally, some options are specific to certain drivers.35
For example, think of coordinates stored in a spreadsheet format (
To read in such files as spatial objects, we naturally have to specify the names of the columns (
Y in our example below) representing the coordinates.
We can do this with the help of the
To find out about possible options, please refer to the ‘Open Options’ section of the corresponding GDAL driver description.
For the comma-separated value (csv) format, visit http://www.gdal.org/drv_csv.html.
Instead of columns describing xy-coordinates, a single column can also contain the geometry information.
Well-known text (WKT), well-known binary (WKB), and the GeoJSON formats are examples of this.
For instance, the
world_wkt.csv file has a column named
WKT representing polygons of the world’s countries.
We will again use the
options parameter to indicate this.
Here, we will use
read_sf() which does exactly the same as
st_read() except it does not print the driver name to the console and stores strings as characters instead of factors.
st_set_crs()function. Please refer also to section 2.3 for more information.
As a final example, we will show how
st_read() also reads KML files.
A KML file stores geographic information in XML format - a data format for the creation of web pages and the transfer of data in an application-independent way (Nolan and Lang 2014).
Here, we access a KML file from the web.
This file contains more than one layer.
st_layers() lists all available layers.
We choose the first layer
Placemarks and say so with the help of the
layer parameter in
url = "https://developers.google.com/kml/documentation/KML_Samples.kml" st_layers(url) #> Driver: LIBKML #> Available layers: #> layer_name geometry_type features fields #> 1 Placemarks 3 11 #> 2 Styles and Markup 1 11 #> 3 Highlighted Icon 1 11 #> 4 Ground Overlays 1 11 #> 5 Screen Overlays 0 11 #> 6 Paths 6 11 #> 7 Polygons 0 11 #> 8 Google Campus 4 11 #> 9 Extruded Polygon 1 11 #> 10 Absolute and Relative 4 11 kml = read_sf(url, layer = "Placemarks")
7.5.2 Raster data
Similar to vector data, raster data comes in many file formats with some of them supporting even multilayer files.
raster() command reads in a single layer.
In case you want to read in a single band from a multilayer file use the
band parameter to indicate a specific layer.
If you want to read in all bands, use
Please refer to section 2.2.3 for information on the difference between raster stacks and bricks.
7.6 Data output (O)
Writing spatial data allows you to convert from one format to another and to save newly created objects.
Depending on the data type (vector or raster), object class (e.g
RasterLayer), and type and amount of stored information (e.g. object size, range of values) - it is important to know how to store spatial files in the most efficient way.
The next two sections will demonstrate how to do this.
7.6.1 Vector data
The counterpart of
It allows you to write sf objects to a wide range of geographic vector file formats, including the most common such as
Based on the file name,
st_write() decides automatically which driver to use.
The speed of the writing process depends also on the driver.
Note: if you try to write to the same data source again, the function will fail:
st_write(obj = world, dsn = "world.gpkg") #> Updating layer `world' to data source `/home/travis/build/Robinlovelace/geocompr/world.gpkg' using driver `GPKG' #> Warning in CPL_write_ogr(obj, dsn, layer, driver, as.character(dataset_options), : GDAL Error 1: Layer world already exists, CreateLayer failed. #> Use the layer creation option OVERWRITE=YES to replace it. #> Creating layer world failed. #> Error in CPL_write_ogr(obj, dsn, layer, driver, as.character(dataset_options), : Layer creation failed.
The error message provides some information as to why the function failed.
GDAL Error 1 statement makes clear that the failure occurred at the GDAL level.
Additionally, the suggestion to use
OVERWRITE=YES provides a clue about how to fix the problem.
However, this is not a
st_write() argument, it is a GDAL option.
st_write provides a
layer_options argument through which we can pass driver-dependent options:
Another solution is to use the
delete_layer. Setting it to
TRUE deletes already existing layers in the data source before the function attempts to write (note there is also a
You can achieve the same with
write_sf() since it is equivalent to (technically an alias for)
st_write(), except that its defaults for
layer_options argument could be also used for many different purposes.
One of them is to write spatial data to a text file.
This can be done by specifying
GEOMETRY inside of
It could be either
AS_XY for simple point datasets (it creates two new columns for coordinates) or
AS_WKT for more complex spatial data (one new column is created which contains the well-known-text representation of spatial objects).
7.6.2 Raster data
writeRaster() function saves
Raster* objects to files on disk.
The function expects input regarding output data type and file format, but also accepts GDAL options specific to a selected file format (see
?writeRaster for more details).
The raster package offers nine data types when saving a raster: LOG1S, INT1S, INT1U, INT2S, INT2U, INT4S, INT4U, FLT4S, and FLT8S.36
The data type determines the bit representation of the raster object written to disk (7.4).
Which data type to use depends on the range of the values of your raster object.
The more values a data type can represent, the larger the file will get on disk.
Commonly, one would use LOG1S for bitmap (binary) rasters.
Unsigned integers (INT1U, INT2U, INT4U) are suitable for categorical data, while float numbers (FLT4S and FLTS8S) usually represent continuous data.
writeRaster() uses FLT4S as the default.
While this works in most cases, the size of the output file will be unnecessarily large if you save binary or categorical data.
Therefore, we would recommend to use the data type that needs the least storage space but is still able to represent all values (check the range of values with the
|Data type||Minimum value||Maximum value|
|LOG1S||FALSE (0)||TRUE (1)|
The file extension determines the output file when saving a
Raster* object to disk.
For example, the
.tif extension will create a GeoTIFF file:
raster file format (native to the
raster package) is used when a file extension is invalid or missing.
Some raster file formats come with additional options.
You can use them with the
GeoTIFF files, for example, can be compressed using
writeFormats() returns a list with all supported file formats on your computer.
7.7 Visual outputs
R supports many different static and interactive graphics formats. The most general method to save a static plot is to open a graphic device, create a plot, and close it, for example:
Other available graphic devices include
You can specify several properties of the output plot, including width, height and resolution.
Additionally, several graphic packages provide their own functions to save a graphical output.
For example, the tmap package has the
You can save a
tmap object to different graphic formats by specifying the object name and a file path to a new graphic file.
On the other hand, you can save interactive maps created in the
mapview package as an HTML file or image using the
List and describe three types of vector, raster, and geodatabase formats.
Name at least two differences between
read_sf()and the more well-known function
cycle_hire_xy.csvfile from the spData package (Hint: it is located in the
misc\folder). What is a geometry type of the loaded object?
Download the borders of Germany using rnaturalearth, and create a new object called
germany_borders. Write this new object to a file of the GeoPackage format.
Download the global monthly minimum temperature with a spatial resolution of five minutes using the raster package. Extract the June values, and save them to a file named
tmin_june.tiffile (hint: use
Create a static map of Germany’s borders, and save it to a PNG file.
Create an interactive map using data from the
cycle_hire_xy.csvfile. Export this map to a file called
Gillespie, Colin, and Robin Lovelace. 2016. Efficient R Programming: A Practical Guide to Smarter Programming. O’Reilly Media.
Padgham, Mark, Bob Rudis, Robin Lovelace, and Maëlle Salmon. 2018. Osmdata: Import ’Openstreetmap’ Data as Simple Features or Spatial Objects. https://CRAN.R-project.org/package=osmdata.
Coppock, J Terry, and David W Rhind. 1991. “The History of GIS.” Geographical Information Systems: Principles and Applications, Vol. 1. 1 (1): 21–43.
Nolan, Deborah, and Duncan Temple Lang. 2014. XML and Web Technologies for Data Sciences with R. Use R! New York, NY: Springer.
Using INT4U is not recommended as R does not support 32-bit unsigned integers.↩