6 Geographic data I/O
- This chapter requires the following packages:
This chapter is about reading and writing geographic data. Geographic data import is an essential part of geocomputational software because without data real-world applications are impossible. The skills taught in this book will enable you to add value to data meaning that, for others to benefit from the results, data output is also vital. These two processes go hand-in-hand and are referred to as I/O — short for input/output — in Computer Science (Gillespie and Lovelace 2016). Hence the title of this chapter.
Geographic data I/O is almost always part of a wider process. It depends on knowing which datasets are available, where they can be found and how to retrieve them, topics covered in section 6.2. This section demonstrates how to access open access geoportals which collectively contain many terrabytes of data. There is a wide range of geographic file formats, each of which has pros and cons. These are described in section 6.3. The process of actually reading and writing such file formats efficiently is not covered until sections 6.4 and 6.5 respectively. The final section (6.6) demonstrates methods for saving visual outputs (maps), in preparation for a subsequent chapter dedicated to visualization.
6.2 Retrieving open data
Nowadays, a vast amount of spatial data is available on the internet. Best of all, much of it is freely available. You just have to know where to look. While we cannot provide a comprehensive guide to all available geodata, we point to a few of the most important sources. Various ‘geoportals’ (web services providing geographic data such as the geospatial datasets in Data.gov) are a good place to start, providing a wide range of geographic data. Geoportals are a very useful data source but often only contain data for a specific locations (see the Wikipedia page on geoportals for a list of geoportals covering many areas of the world).
To overcome this limitation some global geoportals have been developed. The GEOSS portal, for example, provides global remote sensing data. Additional geoportals with global coverage and an emphasis on raster data include the EarthExplorer and the Copernicus Open Access Hub. A wealth of European data is available from the INSPIRE geoportal.
Typically, geoportals provide an interface that lets you query interactively the existing data (spatial and temporal extent, and product). However, in this book we encourage you to create reproducible workflows. In many cases data downloads can be scripted via download calls to static URLs or APIs (see the Sentinel API for example), saving time and enabling others to repeat and update the unglamorous data download process.
Traditionally, files have been stored on servers.
You can easily download such files with the
For example, to download National Park Service units in the United States, run:
Specific R packages providing an interface to spatial libraries or geoportals are even more user-friendly (Table 6.1).
|osmdata||Download and import of OpenStreetMap data.|
|rnaturalearth||Functions to download Natural Earth vector and raster data, including world country borders.|
|rnoaa||An R interface to National Oceanic and Atmospheric Administration (NOAA) climate data.|
|rWBclimate||An access to the World Bank climate data.|
For example, you can get the borders of any country with the
ne_countries() function from the rnaturalearth package:
As a default, rnaturalearth returns the output as a
You can easily convert it to the
sf class with
As a second example, we will download a raster dataset.
The code below downloads a series of rasters that contains global monthly precipitation sums.
The spatial resolution is ten minutes.
The result is a multilayer object of class
A third example uses the recently developed package osmdata (Padgham et al. 2018) to find parks from the OpenStreetMap (OSM) database.
As illustrated in the code-chunk below, queries begin with the function
opq() (short for OpenStreetMap query), the first argument of which is bounding box, or text string representing a bounding box (the city of Leeds in this case).
The result is passed to a function for selecting which OSM elements we’re interested in (parks in this case), represented by key-value pairs, which in turn is passed to the function
osmdata_sf() which does the work of downloading the data and converting it into a list of
sf objects (see
vignette('osmdata') for further details):
OpenStreetMap is a vast global database of crowd-sourced data and it is growing by the minute. Although the quality is not as spatially consistent as many official datasets, OSM data have many advantages: they are globally available free of charge and using crowd-source data can encourage ‘citizen science’ and contributions back to the digital commons. Further examples of osmdata in action are provided in Chapters 7 and 8.
Finally, R packages might contain or just consist of spatial data, such as spData which provides example datasets used in this book.
You can access such data with the
For example, you can get a list of dataset in a package,
data(package = "spData").
To attach the dataset to the global environment specify the name of a dataset (
data("cycle_hire", package = "spData")).
Sometimes, packages come also with the original files.37
To load such a file from the package, you need to specify the package name and the relative path to the dataset, for example:
world_raw_filepath = system.file("shapes/world.gpkg", package = "spData") world_raw = st_read(world_raw_filepath) #> Reading layer `wrld.gpkg' from data source `/home/travis/R/Library/spData/shapes/world.gpkg' using driver `GPKG' #> Simple feature collection with 177 features and 10 fields #> geometry type: MULTIPOLYGON #> dimension: XY #> bbox: xmin: -180 ymin: -90 xmax: 180 ymax: 83.6 #> epsg (SRID): 4326 #> proj4string: +proj=longlat +datum=WGS84 +no_defs
6.3 File formats
Spatial datasets are usually stored as files or in spatial databases. File formats can either store vector or raster data, while spatial databases such as PostGIS can store both. Today file formats may seem bewildering but there has been much consolidation and standardisation since the beginnings of GIS software in the 1960s when the first widely distributed program (SYMAP) for spatial analysis was created at Harvard University (Coppock and Rhind 1991).
GDAL,38 the Geospatial Data Abstraction Library, has resolved many issues associated with incompatibility between file formats since its release in 2000. GDAL provides a unified and high-performance interface for reading and writing of many raster and vector data formats. Many open and proprietary GIS programs, including GRASS, ArcGIS and QGIS, use GDAL behind their GUIs for doing the legwork of ingesting and spitting-out geographic data in appropriate formats.
An important development ensuring greater standardisation and open-sourcing of file formats was the founding of the Open Geospatial Consortium (OGC) in 1994.39 The OGC coordinates the development of open standards for geospatial content including file formats such as KML and GeoPackage. As described in Chapter 2 the OGC publishes the simple feature data model, which underlies the vector data classes provided by sf and used in this book. Open file formats of the kind endorsed by the OGC have several advantages over proprietary formats: the standards are published, ensuring transparency and enabling innovation to improve the file formats.
There are more than 100 spatial data formats exist available to R users via GDAL. Table 6.2 presents some basic information about selected and often used spatial file formats.
|ESRI Shapefile||.shp (the main file)||One of the most popular vector file formats. Consists of at least three files. The main files size cannot exceed 2 GB. It lacks support for mixed type. Column names are limited to 10 characters, and number of columns are limited at 255. It has poor support for Unicode standard.||Vector||Partially open|
|GeoJSON||.geojson||Extends the JSON exchange format by including a subset of the simple feature representation.||Vector||Open|
|KML||.kml||XML-based format for spatial visualization, developed for use with Google Earth. Zipped KML file forms the KMZ format.||Vector||Open|
|GPX||.gpx||XML schema created for exchange of GPS data.||Vector||Open|
|GeoTIFF||.tiff||GeoTIFF is one of the most popular raster formats. Its structure is similar to the regular
|Arc ASCII||.asc||Text format where the first six lines represent the raster header, followed by the raster cell values arranged in rows and columns.||Raster||Open|
|R-raster||.gri, .grd||Native raster format of the R-package raster.||Raster||Open|
|SQLite/SpatiaLite||.sqlite||SQLite is a standalone, relational database management system. It is used as a default database driver in GRASS GIS 7. SpatiaLite is the spatial extension of SQLite providing support for simple features.||Vector and raster||Open|
|ESRI FileGDB||.gdb||Collection of spatial and nonspatial objects created in the ArcGIS software. It allows storage of multiple feature classes and enables use of topological definitions. Limited access to this format is provided by GDAL with the use of the OpenFileGDB and FileGDB drivers.||Vector and raster||Proprietary|
|GeoPackage||.gpkg||Lightweight database container based on SQLite allowing an easy and platform-independent exchange of geodata||Vector and raster||Open|
6.4 Data Input (I)
Executing commands such as
sf::st_read() (the main function we use for loading vector data) or
raster::raster() (the main function used for loading raster data) silently sets off a chain of events that reads data from files.
Moreover, there are many R packages containing a wide range of spatial data or providing simple access to different data sources.
All of them load the data into R or, more precisely, assign objects to your workspace, stored in RAM accessible from the
.GlobalEnv40 of your current R session.
6.4.1 Vector data
Spatial vector data comes in a wide variety of file formats, most of which can be read-in via the sf function
Behind the scenes this calls GDAL.
To find out which data formats sf supports, run
Here, we show only the first five drivers (see Table 6.3):
|ESRI Shapefile||ESRI Shapefile||TRUE||FALSE||FALSE||TRUE||TRUE|
|KML||Keyhole Markup Language (KML)||TRUE||FALSE||FALSE||TRUE||TRUE|
The first argument of
dsn, which should be a text string or an object containing a single text string.
The content of a text string could vary between different drivers.
In most cases, as with the ESRI Shapefile (
.shp) or the
GeoPackage format (
dsn would be a file name.
st_read() guesses the driver based on the file extension, as illustrated for a
.gpkg file below:
vector_filepath = system.file("shapes/world.gpkg", package = "spData") world = st_read(vector_filepath) #> Reading layer `wrld.gpkg' from data source `/home/travis/R/Library/spData/shapes/world.gpkg' using driver `GPKG' #> Simple feature collection with 177 features and 10 fields #> geometry type: MULTIPOLYGON #> dimension: XY #> bbox: xmin: -180 ymin: -90 xmax: 180 ymax: 83.6 #> epsg (SRID): 4326 #> proj4string: +proj=longlat +datum=WGS84 +no_defs
For some drivers,
dsn could be provided as a folder name, access credentials for a database, or a GeoJSON string representation (see the examples of the
st_read() help page for more details).
Some vector driver formats can store multiple data layers.
st_read automatically reads the first layer of the file specified in
dsn, however, using the
layer argument you can specify any other layer.
Naturally, some options are specific to certain drivers.41
For example, think of coordinates stored in a spreadsheet format (
To read in such files as spatial objects, we naturally have to specify the names of the columns (
Y in our example below) representing the coordinates.
We can do this with the help of the
To find out about possible options, please refer to the ‘Open Options’ section of the corresponding GDAL driver description.
For the comma-separated value (csv) format, visit http://www.gdal.org/drv_csv.html.
Instead of columns describing xy-coordinates, a single column can also contain the geometry information.
Well-known text (WKT), well-known binary (WKB), and the GeoJSON formats are examples of this.
For instance, the
world_wkt.csv file has a column named
WKT representing polygons of the world’s countries.
We will again use the
options parameter to indicate this.
Here, we will use
read_sf() which does exactly the same as
st_read() except it does not print the driver name to the console and stores strings as characters instead of factors.
st_set_crs()function. Please refer also to section 2.3 for more information.
As a final example, we will show how
st_read() also reads KML files.
A KML file stores geographic information in XML format - a data format for the creation of web pages and the transfer of data in an application-independent way (Nolan and Lang 2014).
Here, we access a KML file from the web.
This file contains more than one layer.
st_layers() lists all available layers.
We choose the first layer
Placemarks and say so with the help of the
layer parameter in
url = "https://developers.google.com/kml/documentation/KML_Samples.kml" st_layers(url) #> Driver: LIBKML #> Available layers: #> layer_name geometry_type features fields #> 1 Placemarks 3 11 #> 2 Styles and Markup 1 11 #> 3 Highlighted Icon 1 11 #> 4 Ground Overlays 1 11 #> 5 Screen Overlays 0 11 #> 6 Paths 6 11 #> 7 Polygons 0 11 #> 8 Google Campus 4 11 #> 9 Extruded Polygon 1 11 #> 10 Absolute and Relative 4 11 kml = read_sf(url, layer = "Placemarks")
6.4.2 Raster data
Similar to vector data, raster data comes in many file formats with some of them supporting even multilayer files.
raster() command reads in a single layer.
In case you want to read in a single band from a multilayer file use the
band parameter to indicate a specific layer.
If you want to read in all bands, use
Please refer to section 2.2.3 for information on the difference between raster stacks and bricks.
6.5 Data output (O)
Writing spatial data allows you to convert from one format to another and to save newly created objects.
Depending on the data type (vector or raster), object class (e.g
RasterLayer), and type and amount of stored information (e.g. object size, range of values) - it is important to know how to store spatial files in the most efficient way.
The next two sections will demonstrate how to do this.
6.5.1 Vector data
The counterpart of
It allows you to write sf objects to a wide range of geographic vector file formats, including the most common such as
Based on the file name,
st_write() decides automatically which driver to use.
The speed of the writing process depends also on the driver.
Note: if you try to write to the same data source again, the function will fail:
st_write(obj = world, dsn = "world.gpkg") #> Updating layer `world' to data source `/home/travis/build/Robinlovelace/geocompr/world.gpkg' using driver `GPKG' #> Warning in CPL_write_ogr(obj, dsn, layer, driver, as.character(dataset_options), : GDAL Error 1: Layer world already exists, CreateLayer failed. #> Use the layer creation option OVERWRITE=YES to replace it. #> Creating layer world failed. #> Error in CPL_write_ogr(obj, dsn, layer, driver, as.character(dataset_options), : Layer creation failed.
The error message provides some information as to why the function failed.
GDAL Error 1 statement makes clear that the failure occurred at the GDAL level.
Additionally, the suggestion to use
OVERWRITE=YES provides a clue about how to fix the problem.
However, this is not a
st_write() argument, it is a GDAL option.
st_write provides a
layer_options argument through which we can pass driver-dependent options:
Another solution is to use the
delete_layer. Setting it to
TRUE deletes already existing layers in the data source before the function attempts to write (note there is also a
You can achieve the same with
write_sf() since it is equivalent to (technically an alias for)
st_write(), except that its defaults for
layer_options argument could be also used for many different purposes.
One of them is to write spatial data to a text file.
This can be done by specifying
GEOMETRY inside of
It could be either
AS_XY for simple point datasets (it creates two new columns for coordinates) or
AS_WKT for more complex spatial data (one new column is created which contains the well-known-text representation of spatial objects).
6.5.2 Raster data
writeRaster() function saves
Raster* objects to files on disk.
The function expects input regarding output datatype and file format, but also accepts GDAL options specific to a selected file format (see
?writeRaster for more details).
The raster package offers nine datatypes when saving a raster: LOG1S, INT1S, INT1U, INT2S, INT2U, INT4S, INT4U, FLT4S, and FLT8S.42
The datatype determines the bit representation of the raster object written to disk (6.4).
Which datatype to use depends on the range of the values of your raster object.
The more values a datatype can represent, the larger the file will get on disk.
Commonly, one would use LOG1S for bitmap (binary) rasters.
Unsigned integers (INT1U, INT2U, INT4U) are suitable for categorical data, while float numbers (FLT4S and FLTS8S) usually represent continuous data.
writeRaster() uses FLT4S as the default.
While this works in most cases, the size of the output file will be unnecessarly large if you save binary or categorical data.
Therefore, we would recommend to use the datatype that needs the least storage space but is still able to represent all values (check the range of values with the
|Datatype||Minimum value||Maximum value|
|LOG1S||FALSE (0)||TRUE (1)|
The file extension determines the output file when saving a
Raster* object to disk.
For example, the
.tif extension will create a GeoTIFF file:
raster file format (native to the
raster package) is used when a file extension is invalid or missing.
Some raster file formats come with additional options.
You can use them with the
For example, GeoTIFF allows you to compress the output raster with the
writeFormats() returns a list with all supported file formats on your computer.
6.6 Visual outputs
R supports many different static and interactive graphics formats. The most general method to save a static plot is to open a graphic device, create a plot, and close it, for example:
Other available graphic devices include
You can specify several properties of the output plot, including width, height and resolution.
Additionally, several graphic packages provide thier own functions to save a graphical output.
For example, the tmap package has the
You can save a
tmap object to different graphic formats by specifying the object name and a file path to a new graphic file.
On the other hand, you can save interactive maps created in the
mapview package as an HTML file or image using the
List and describe three types of vector, raster, and geodatabase formats.
Name at least two differences between
read_sf()and the more well-known function
cycle_hire_xy.csvfile from the spData package (Hint: it is located in the
misc\folder). What is a geometry type of the loaded object?
Download the borders of Germany using rnaturalearth, and create a new object called
germany_borders. Write this new object to a file of the GeoPackage format.
Download the global monthly minimum temperature with a spatial resolution of five minutes using the raster package. Extract the June values, and save them to a file named
tmin_june.tiffile (hint: use
Create a static map of Germany’s borders, and save it to a PNG file.
Create an interactive map using data from the
cycle_hire_xy.csvfile. Export this map to a file called
Gillespie, Colin, and Robin Lovelace. 2016. Efficient R Programming: A Practical Guide to Smarter Programming. O’Reilly Media.
Padgham, Mark, Bob Rudis, Robin Lovelace, and Maëlle Salmon. 2018. Osmdata: Import ’Openstreetmap’ Data as Simple Features or Spatial Objects. https://github.com/ropensci/osmdata.
Coppock, J Terry, and David W Rhind. 1991. “The History of GIS.” Geographical Information Systems: Principles and Applications, Vol. 1. 1 (1): 21–43.
Nolan, Deborah, and Duncan Temple Lang. 2014. XML and Web Technologies for Data Sciences with R. Use R! New York, NY: Springer.
Data loaded with
data()often is a R dataset (
Using INT4U is not recommended as R does not support 32-bit unsigned integers.↩