# 5 Geometric operations

## Prerequisites

• This chapter uses the same packages as Chapter 4 but with the addition of spDataLarge, which was installed in Chapter 2 (lwgeom is also used, but does not need to be attached):
library(sf)
library(raster)
library(tidyverse)
library(spData)
library(spDataLarge)

## 5.1 Introduction

The previous three chapters have demonstrated how geographic datasets are structured in R (Chapter 2) and how to manipulate them based on their non-geographic attributes (Chapter 3) and spatial properties (Chapter 4). This chapter extends these skills. After reading it — and attempting the exercises at the end — you should understand and have control over the geometry column in sf objects and the geographic location of pixels represented in rasters.

Section 5.3 covers transforming vector geometries with ‘unary’ and ‘binary’ operations. Unary operations work on a single geometry in isolation. This includes simplification (of lines and polygons), the creation of buffers and centroids, and shifting/scaling/rotating single geometries using ‘affine transformations’ (sections 5.3.1 to 5.3.4). Binary transformations modify one geometry based on the shape of another. This includes clipping and geometry unions, covered in sections 5.3.5 and 5.3.6 respectively. Type transformations (from a polygon to a line, for example) are demonstrated in section 5.3.7.

Section 5.4 covers geometric transformations on raster objects. This involves changing the size and number of the underlying pixels, and assigning them new values. It teaches how to change the resolution (also called raster aggregation and disaggregation), the extent and the origin of a raster. These operations are especially useful if one would like to align raster datasets from diverse sources. Aligned raster objects share the same header information, allowing them to be processed using map algebra operations, described in section 4.3.2.

A vital type of geometry transformation is reprojecting from one coordinate reference system (CRS) to another. Because of the importance of reprojection, introduced in Chapter 2, and the fact that it applies to raster and vector geometries alike, it is the topic of the first section in this chapter.

## 5.2 Reprojecting geographic data

Section 2.3 introduced coordinate reference systems (CRSs) and demonstrated their importance. This section goes further. It highlights issues that can arise when using inappropriate CRSs and how to transform data from one CRS to another.

As illustrated in Figure 2.1, there are two types of CRS: geographic (‘lon/lat’, with units in degrees longitude and latitude) and projected (typically in units of meters from a datum). This has consequences because many geometric operations a projected CRS: most geometric operations in sf, for example, assume a projected CRS and generate warnings if the data is geographic, using the function st_is_longlat() (this is because under the hood GEOS assumes projected data). Unfortunately R does not always know the CRS of an object, as shown below using the example of London introduced in section 2.1:

london = data.frame(lon = -0.1, lat = 51.5) %>%
st_as_sf(coords = c("lon", "lat"))
st_is_longlat(london)
#> [1] NA

This shows that unless a CRS is manually specifified or is loaded from a source that has CRS metadata, the CRS is NA. A CRS can be added to sf objects with st_set_crs() as follows:26

london_geo = st_set_crs(london, 4326)
st_is_longlat(london_geo)
#> [1] TRUE

If the CRS of an object with lon/lat coordinates is not specified CRSs, this can lead to problems. An example is provided below, which creates a buffer of one unit around london and london_geo objects:

london_buff_no_crs = st_buffer(london, dist = 1)
london_buff = st_buffer(london_geo, dist = 1)
#> Warning in st_buffer.sfc(st_geometry(x), dist, nQuadSegs): st_buffer does
#> not correctly buffer longitude/latitude data
#> dist is assumed to be in decimal degrees (arc_degrees).

Only the second operation generates a warning. The warning message it useful, telling us that result may be of limited use because it is in units of latitude and longitude, rather than meters or some other suitable measure of distance assumed by st_buffer(). The consequences of a failure to work on projected data are illustrated in Figure 5.1 (left panel): the buffer is elongated in the north-south direction because lines of longitude converge towards the Earth’s poles.

The distance between two lines of longitude, called meridians, is around 111 km at the equator (execute geosphere::distGeo(c(0, 0), c(1, 0)) to find the precise distance). This shrinks to zero at the poles. At the latitude of London, for example, meridians are less than 70 km apart (challenge: execute code that verifies this). Lines of latitude, by contrast, have constant distance from each other irrespective of latitude: they are always around 111 km apart, including at the equator and near the poles. This is illustrated in Figures 5.1 and 5.3.

Do not interpret the warning about the geographic (longitude/latitude) CRS as “the CRS should not be set”: it almost always should be! It is better understood as a suggestion to reproject the data onto a projected CRS. This suggestion does not always need to be heeded: performing spatial and geometric operations makes little or no difference in some cases (e.g. spatial subsetting). But for operations involving distances such as buffering, the only way to ensure a good result is to create a projected copy of the data and run the operation on that. This is done in the code chunk below:

london_proj = data.frame(x = 530000, y = 180000) %>%
st_as_sf(coords = 1:2, crs = 27700)

The result is a new object that is identical to london, but reprojected onto a suitable CRS (the British National Grid, which has an EPSG code of 27700 in this case) that has units of meters. We can verify that the CRS has changed using st_crs() as follows (some of the output has been replace by ...):

st_crs(london_proj)
#> Coordinate Reference System:
#>   EPSG: 27700
#>   proj4string: "+proj=tmerc +lat_0=49 +lon_0=-2 ... +units=m +no_defs"

Notable components of this CRS description include the EPSG code (EPSG: 27700), the projection (transverse Mercator, +proj=tmerc), the origin (+lat_0=49 +lon_0=-2) and units (+units=m).27 The fact that the units of the CRS are meters (rather than degrees) tells us that this is a projected CRS: st_is_longlat(london_proj) now returns FALSE and geometry operations on london_proj will work without a warning, meaning buffers can be produced from it using proper units of distance. As pointed out above, moving one degree means moving a bit more than 111 km at the equator (to be precise: 111,320 meters). This is used as the new buffer distance:

london_proj_buff = st_buffer(london_proj, 111320)

The result in Figure 5.1 (right panel) shows that buffers based on a projected CRS are not distorted: every part of the buffer’s border is equidistant to London.

The importance of CRSs (primarily whether they are projected or geographic) has been demonstrated using the example of London. The subsequent sections go into more depth, exploring which CRS to use and the details of reprojecting vector and raster objects.

### 5.2.1 When to reproject?

The previous section showed how to set the CRS manually, with st_set_crs(london, 4326). In real world applications, however, CRSs are usually set automatically when data is read-in. The main task involving CRSs is often to transform objects, from one CRS into another. But when should data be transformed? And into which CRS? There are no clear-cut answers to these questions and CRS selection always involves trade-offs (Maling 1992). However there are some general principles, provided in this section, that can help decide.

First it’s worth considering when to transform. In some cases transformation to a projected CRS is essential, such as when using geometric functions such as st_buffer(), Figure 5.1 shows. Conversely, publishing data online with the leaflet package may require a geographic CRS. Another case is when two objects with different CRSs must be compared or combined, as shown when we try to find the distance between two objects with different CRSs:

st_distance(london_geo, london_proj)
# > Error: st_crs(x) == st_crs(y) is not TRUE

To make the london and london_proj objects geographically comparable one of them must be transformed into the CRS of the other. But which CRS to use? The answer is usually ‘to the projected CRS’, which in this case is the British National Grid (BNG, EPSG:27700):

london2 = st_transform(london_geo, 27700)

Now that a transformed version of london has been created, using the sf function st_transform(), the distance between the two representations of London can be found. It may come as a surprise that london and london2 are just over 2 km apart!28

st_distance(london2, london_proj)
#> Units: m
#>      [,1]
#> [1,] 2018

### 5.2.2 Which CRS to use?

The question of which CRS is tricky, and there is rarely a ‘right’ answer: “There exist no all-purpose projections, all involve distortion when far from the center of the specified frame” (Bivand, Pebesma, and Gómez-Rubio 2013). For geographic CRSs the answer is often WGS84, not only for web mapping (covered in the previous paragraph) but also because GPS datasets and thousands of raster and vector datasets are provided in this CRS by default. WGS84 is the most common CRS in the world, so it is worth knowing its EPSG code: 4326. This ‘magic number’ can be used to convert objects with unusual projected CRSs into something that is widely understood.

What about when a projected CRS is required? In some cases it is not something that we are free to decide: “often the choice of projection is made by a public mapping agency” (Bivand, Pebesma, and Gómez-Rubio 2013). This means that when working with local data sources, it is likely preferable to work with the CRS in which the data was provided, to ensure compatibility, even if the official CRS is not the most accurate. The example of London was easy to answer because a) the CRS ‘BNG’ (with its associated EPSG code 27700) is well-known and b) the original dataset (london) already had that CRS.

In cases where an appropriate CRS is not immediately clear, the choice of CRS should depend on the properties that are most important to preserve in the subsequent maps and analysis. All CRSs are either equal area, equi-distant, conformal (with shapes remaining unchanged), or some combination of compromises of those. Custom CRSs with local parameters can be created for a region of interest and multiple CRSs can be used in projects when no single CRS suits all tasks. ‘Geodesic calculations’ can provide a fall-back if no CRSs are appropriate (see proj4.org/geodesic.html). For any projected CRS the results may not be accurate when used on geometries covering hundreds of kilometers.

When deciding a custom CRS we recommend the following:29

• A Lambert azimuthal equal-area (LAEA) projection for a custom local projection (set lon_0 and lat_0 to the center of the study area), which is an equal-area projection at all locations but distorts shapes beyond thousands of kilometres.
• Azimuthal equidistant (AEQD) projections for a specifically accurate straight-line distance between a point and the centre point of the local projection.
• Lambert conformal conic (LCC) projections for regions covering thousands of kilometres, with the cone set to keep distance and area properties reasonable between the secant lines.
• Stereographic (STERE) projections for polar regions, but taking care not to rely on area and distance calculations thousands of kilometres from the center.

A commonly used default is Universal Transverse Mercator (UTM), a set of CRSs that divides the Earth into 60 longitudinal wedges and 20 latitudinal segments. The transverse Mercator projection used by UTM CRSs is conformal but distorts areas and distances with increasing severity with distance from the center of the UTM zone. Documentation from the GIS software Manifold therefore suggests restricting the longitudinal extent of projects using UTM zones to 6 degrees from the central meridian (source: manifold.net).

Almost everywhere on Earth has a UTM code, such as “60H” which refers to northern New Zealand where R was invented. All UTM projections have the same datum (WGS84) and their EPSG codes run sequentially from 32601 to 32660 (for northern hemisphere locations) and 32701 to 32760 (southern hemisphere locations).

To show how the system works let’s create a function, lonlat2UTM() to calculate the EPSG code associated with any point on the planet as follows:30

lonlat2UTM = function(lonlat) {
utm = (floor((lonlat[1] + 180) / 6) %% 60) + 1
if(lonlat[2] > 0) utm + 32600 else
utm + 32700
}

The following commands uses this function to identify the UTM zone and associated EPSG code for Auckland and London:

epsg_utm_auk = lonlat2UTM(c(174.7, -36.9))
epsg_utm_lnd = lonlat2UTM(st_coordinates(london))
st_crs(epsg_utm_auk)$proj4string #> [1] "+proj=utm +zone=60 +south +datum=WGS84 +units=m +no_defs" st_crs(epsg_utm_lnd)$proj4string
#> [1] "+proj=utm +zone=30 +datum=WGS84 +units=m +no_defs"

Maps of UTM zones such as that provided by dmap.co.uk confirm: the London is in UTM zone 30U.

Another approach to automatically selecting a projected CRS specific to a local dataset is to create an azimuthal equidistant (AEQD) projection for the center-point of the study area. This involves creating a custom CRS (with no EPSG code) with units of meters based on the centerpoint of a dataset. This approach should be used with caution: no other datasets will be compatible with the custom CRS created and results may not be accurate when used on extensive datasets covering hundreds of kilometers.

Although we used vector datasets to illustrate the points outlined in this section, the principles apply equally to raster datasets. The subsequent sections explain features of CRS transformation that are unique to each geographic data model, continuing with vector data in the next section (section 5.2.3) and moving-on to explain how raster transformation is different, in section 5.2.5.

### 5.2.3 Reprojecting vector geometries

Chapter 2 demonstrated how vector geometries are made-up of points, and how points form the basis of more complex objects such as lines and polygons. Reprojecting vectors thus consists of transforming the coordinates of these points. This is illustrated by cycle_hire_osm, an sf object from spData that represents cycle hire locations across London. The previous section showed how the CRS of vector data can be queried with st_crs(). Although the output of this function is printed as a single entity, the result is in fact a named list of class crs, with names proj4string (which contains full details of the CRS) and epsg for its code. This is demonstrated below:

crs_lnd = st_crs(cycle_hire_osm)
class(crs_lnd)
#> [1] "crs"
crs_lnd$epsg #> [1] 4326 This duality of CRS objects means that they can be set either using an EPSG code or a proj4string. This means that st_crs("+proj=longlat +datum=WGS84 +no_defs") is equivalent to st_crs(4326), although not all proj4strings have an associated EPSG code. Both elements of the CRS are changed by transforming the object to a projected CRS: cycle_hire_osm_projected = st_transform(cycle_hire_osm, 27700) The resulting object has a new CRS with an EPSG code 27700. But how to find out more details about this EPSG code, or any code? One option is to search for it online. Another option is to use a function from the rgdal package to find the name of the CRS: crs_codes = rgdal::make_EPSG()[1:2] dplyr::filter(crs_codes, code == 27700) #> code note #> 1 27700 # OSGB 1936 / British National Grid The result shows that the EPSG code 27700 represents the British National Grid, a result that could have been found by searching online for “EPSG 27700”. But what about the proj4string element? proj4strings are text strings in a particular format the describe the CRS. They can be seen as formulas for converting a projected point into a point on the surface of the Earth and can be accessed from crs objects as follows (see proj4.org for further details of what the output means): st_crs(27700)$proj4string
#> [1] "+proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 ...
Printing a spatial object in the console, automatically returns its coordinate reference system. To access and modify it explicitly, use the st_crs function, for example, st_crs(cycle_hire_osm).

### 5.2.4 Modifying map projections

Established CRSs captured by EPSG codes are well-suited for many applications. However in some cases it is desirable to create a new CRS, using a custom proj4string. This system allows a very wide range of projections to be created, as we’ll see in some of the custom map projections in this section.

A long and growing list of projections has been developed and many of these these can be set with the +proj= element of proj4strings.31 When mapping the world while preserving areal relationships, the Mollweide projection is a good choice (Jenny et al. 2017) (Figure 5.2). To use this projection, we need to specify it using the proj4string element, "+proj=moll", in the st_transform function:

world_mollweide = st_transform(world, crs = "+proj=moll")

On the other hand, when mapping the world, it is often desirable to have as little distortion as possible for all spatial properties (area, direction, distance). One of the most popular projections to achieve this is the Winkel tripel projection (Figure 5.3).32 st_transform_proj() from the lwgeom package allows for coordinate transformations to a wide range of CRSs, including the Winkel tripel projection:

world_wintri = lwgeom::st_transform_proj(world, crs = "+proj=wintri")
The two main functions for transformation of simple features coordinates are sf::st_transform() and sf::sf_project(). The st_transform function uses the GDAL interface to PROJ.4, while sf_project() (which works with two-column numeric matrices, representing points) and lwgeom::st_transform_proj() use the PROJ.4 API directly. The first one is appropriate for most situations, and provides a set of the most often used parameters and well defined transformations. The second one allows for greater customization of a projection, which includes cases when some of the PROJ.4 parameters (e.g., +over) or projection (+proj=wintri) is not available in st_transform().

Moreover, PROJ.4 parameters can be modified in most CRS definitions. The below code transforms the coordinates to the Lambert azimuthal equal-area projection centered on longitude and latitude of 0 (Figure 5.4).

world_laea1 = st_transform(world, crs = "+proj=laea +x_0=0 +y_0=0 +lon_0=0 +lat_0=0")

We can change the PROJ.4 parameters, for example the center of the projection using the +lon_0 and +lat_0 parameters. The code below gives the map centered on New York City (Figure 5.5).

world_laea2 = st_transform(world, crs = "+proj=laea +x_0=0 +y_0=0 +lon_0=-74 +lat_0=40")

More information on CRS modifications can be found in the Using PROJ.4 documentation.

### 5.2.5 Reprojecting raster geometries

The projection concepts described in the previous section apply equally to rasters. However, there are important differences in reprojection of vectors and rasters: transforming a vector object involves changing the coordinates of every vertex but this does not apply to raster data. Rasters are composed of rectangular cells of the same size (expressed by map units, such as degrees or meters), so it is impossible to transform coordinates of pixels separately. Raster reprojection involves creating a new raster object, often with a different number of columns and rows than the original. The attributes must subsequently be re-estimated, allowing the new pixels to be ‘filled’ with appropriate values. Thus in most cases when both raster and vector data are used, it is better to avoid reprojecting rasters and reproject vectors instead.

The raster reprojection process is done with projectRaster() from the raster package. Like the st_transform() function demonstrated in the previous section, projectRaster() takes a geographic object (a raster dataset in this case) and a crs argument. However, projectRaster() only accepts the lengthy proj4string definitions of a CRS rather than concise EPSG codes.

It is possible to use a EPSG code in a proj4string definition with "+init=epsg:MY_NUMBER". For example, one can use the "+init=epsg:4326" definition to set CRS to WGS84 (EPSG code of 4326). The PROJ.4 library automatically adds the rest of parameters and converts it into "+init=epsg:4326 +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0",

Let’s take a look at two examples of raster transformation - using categorical and continuous data. Land cover data are usually represented by categorical maps. The nlcd2011.tif file provides information for a small area in Utah, USA obtained from National Land Cover Database 2011 in the NAD83 / UTM zone 12N CRS.

cat_raster = raster(system.file("raster/nlcd2011.tif", package = "spDataLarge"))
crs(cat_raster)
#> CRS arguments:
#>  +proj=utm +zone=12 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m
#> +no_defs

In this region, 14 land cover classes were distinguished33:

unique(cat_raster)
#>  [1] 11 21 22 23 31 41 42 43 52 71 81 82 90 95

When reprojecting categorical raster, we need to ensure that our new estimated values would still have values of our original classes. This could be done using the nearest neighbor method (ngb). In this method, value of the output cell is calculated based on the nearest cell center of the input raster.

For example, we want to change the CRS to WGS 84. It can be desired when we want to visualize a raster data on top of a web basemap, such as the Google or OpenStreetMap map tiles. The first step is to obtain the proj4 definition of this CRS, which can be done using the http://spatialreference.org webpage. The second and last step is to define the reprojection method in the projectRaster() function, which in case of categorical data is the nearest neighbor method (ngb):

wgs84 = "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs"
cat_raster_wgs84 = projectRaster(cat_raster, crs = wgs84, method = "ngb")

Many properties of the new object differ from the previous one, which include the number of columns and rows (and therefore number of cells), resolution (transformed from meters into degrees), and extent, as illustrated in Table 5.1 (note that the number of categories increases from 14 to 15 because of the addition of NA values, not because a new category has been created — the land cover classes are preserved).

Table 5.1: Key attributes in the original (‘cat_raster’) and projected (‘cat_raster_wgs84’) categorical raster datasets.
CRS nrow ncol ncell resolution unique_categories
NAD83 1359 1073 1458207 31.5275 14
WGS84 1394 1111 1548734 0.0003 15

Reprojecting raster data with continuous (numeric or in this case integer) values follows an almost identical procedure. srtm.tif in spDataLarge contains a digital elevation model from the Shuttle Radar Topography Mission (SRTM) representing height above sea level (elevation) in meters. Its CRS is WGS84:

con_raster = raster(system.file("raster/srtm.tif", package = "spDataLarge"))
crs(con_raster)
#> CRS arguments:
#>  +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0

We will reproject this dataset into a projected CRS, but not with the nearest neighbor method which is appropriate for categorical data. Instead we will use the bilinear method which computes the output cell value based on the four nearest cells in the original raster. The values in the projected dataset are the distance-weighted average of the values from these four cells: the closer the input cell is to the center of the output cell, the greater its weight. The following commands create a text string representing the Oblique Lambert azimuthal equal-area projection, and reproject the raster into this CRS, using the bilinear method:

equalarea = "+proj=laea +lat_0=37.32 +lon_0=-113.04"
con_raster_ea = projectRaster(con_raster, crs = equalarea, method = "bilinear")
crs(con_raster_ea)
#> CRS arguments:
#>  +proj=laea +lat_0=37.32 +lon_0=-113.04 +ellps=WGS84

Raster reprojection on numeric variables also leads to small changes values and spatial properties, such as the number of cells, resolution, and extent. These changes are demonstrated in Table 5.234:

Table 5.2: Key attributes original (‘con_raster’) and projected (‘con_raster’) continuous raster datasets.
CRS nrow ncol ncell resolution mean
WGS84 457 465 212505 31.5275 1843
Equal-area 467 478 223226 0.0003 1842
Of course, the limitations of 2D Earth projections apply as much to vector as to raster data. At best we can comply with two out of three spatial properties (distance, area, direction). Therefore, the task at hand determines which projection to choose. For instance, if we are interested in a density (points per grid cell or inhabitants per grid cell) we should use an equal-area projection (see also chapter 13).

There is more to learn about CRSs. An excellent resource in this area, also implemented in R, is the website R Spatial. Chapter 6 for this free online book is recommended reading — see rspatial.org/spatial/rst/6-crs.html

## 5.3 Geometric operations on vector data

This section is about operations that in some way change the geometry of vector (sf) objects. It is more advanced than the spatial data operations presented in the previous chapter (in section 4.2) because here we drill down into the geometry: the functions discussed in this section work on objects of class sfc in addition to objects of class sf.

### 5.3.1 Simplification

Simplification is a process for generalization of vector objects (lines and polygons) usually for use in smaller scale maps. Another reason for simplifying objects is to reduce the amount of memory, disk space and network bandwidth they consume: it may be wise to simplify complex geometries before publishing them as interactive maps. The sf package provides st_simplify(), which uses the GEOS implementation of the Douglas-Peucker algorithm to reduce the vertex count. st_simplify() uses the dTolerance to control the level of generalization in map units (see Douglas and Peucker 1973 for details). Figure 5.6 illustrates simplification of a LINESTRING geometry representing the river Seine and tributaries. The simplified geometry was created by the following command:

seine_simp = st_simplify(seine, dTolerance = 2000)  # 2000 m

The resulting seine_simp object is a copy of the original seine but with fewer vertices. This is apparent, with the result being visually simpler (Figure 5.6, right) and consuming less memory than the original object, as verified below:

object.size(seine)
#> 17304 bytes
object.size(seine_simp)
#> 8320 bytes

Simplification is also applicable for polygons. This is illustrated using us_states, representing the contiguous United States. As we showed in section 5.2, GEOS assumes that the data is in a projected CRS and this could lead to unexpected results when using a geographic CRS. Therefore, the first step is to project the data into some adequate projected CRS, such as US National Atlas Equal Area (epsg = 2163) (on the left in Figure 5.7):

us_states2163 = st_transform(us_states, 2163)

st_simplify() works equally well with projected polygons:

us_states_simp1 = st_simplify(us_states2163, dTolerance = 100000)  # 100 km

A limitation with st_simplify() is that it simplifies objects on a per-geometry basis. This means the ‘topology’ is lost, resulting in overlapping and ‘holy’ areal units illustrated in Figure 5.7 (middle panel). ms_simplify() from rmapshaper provides an alternative that overcomes this issue. By default it uses the Visvalingam algorithm, which overcomes some limitations of the Douglas-Peucker algorithm (Visvalingam and Whyatt 1993). The following code chunk uses this function to simplify us_states2163. The result has only 1% of the vertices of the input (set using the argument keep) but its number of objects remains intact because we set keep_shapes = TRUE35:

# proportion of points to retain (0-1; default 0.05)
us_states2163$AREA = as.numeric(us_states2163$AREA)
us_states_simp2 = rmapshaper::ms_simplify(us_states2163, keep = 0.01,
keep_shapes = TRUE)

Finally, the visual comparison of the original dataset and the two simplified versions shows differences between the Douglas-Peucker (st_simplify) and Visvalingam (ms_simplify) algorithm outputs (Figure 5.7):

### 5.3.2 Centroids

Centroid operations identify the center of geographic objects. Like statistical measures of central tendency (including mean and median definitions of ‘average’), there are many ways to define the geographic center of an object. All of create single point representations of more complex vector objects.

The most commonly used centroid operation is the geographic centroid. This type of centroid operation (often referred to as ‘the centoid’) represents the centre of mass in a spatial object (think of balancing a plate on your finger). Geographic centroids have many uses, for example to create a simple point representation of complex geometries, or to estimate distances between polygons. They can be calculated with the sf function st_centroid() as demonstrated in the code below, which generates the geographic centroids of regions in New Zealand and tributaries to the River Seine, illustrated with black points in Figure 5.8.

nz_centroid = st_centroid(nz)
seine_centroid = st_centroid(seine)

Sometimes the geographic centroid falls outside the boundaries of their parent objects (think of a dohnut). In such cases point on surface operations can be used to guarantee the point will be in the parent object (e.g. for labelling irregular multipolygon objects such as island states), as illustrated by the red points in Figure 5.8. Notice that these red points always lie on their parent objects. They were created with st_point_on_surface() as follows:36

nz_pos = st_point_on_surface(nz)
seine_pos = st_point_on_surface(seine)

Other types of centroid exist, including the Chebyshev center and the visual center. We will not explore these here but it is possible to calculate them using R, as we’ll see Chapter 10.

### 5.3.3 Buffers

Buffers are polygons representing the area within a given distance of a geometric feature: regardless of whether the input is a point, line or polygon, the output is polygon. Unlike simplification (which is often used for visualization and reducing file size) buffering tends to be used for geographic data analysis. How many points are within a given distance of this line? Which demographic groups are within travel distance of this new shop? These kinds of questions can be answered and visualized by creating buffers around the geographic entities of interest.

Figure 5.9 illustrates buffers of different sizes (5 and 20 km) surrounding the river Seine and tributaries. These buffers were created with commands below, which show that the command st_buffer() requires at least two arguments: an input geometry and a distance, provided in the units of the CRS (in this case meters):

seine_buff_5km = st_buffer(seine, dist = 5000)
seine_buff_50km = st_buffer(seine, dist = 50000)
The third and final argument of st_buffer() is nQuadSegs, which means ‘number of segments per quadrant’ and is set by default to 30 (meaning circles created by buffers are composed of $$4 \times 30 = 120$$ lines). This argument rarely needs be set. Unusual cases where it may be useful include when the memory consumed by the output of a buffer operation is a major concern (in which case it should be reduced) or when very high precision is needed (in which case it should be increased).

### 5.3.4 Affine transformations

Affine transformation is any transformation that preserves lines and parallelism. However, angles or length are not necessarily preserved. Affine transformations include, among others, shifting (translation), scaling and rotation. Additionally, it is possible to use any combination of these. Affine transformations are an essential part of geocomputation, e.g. when reprojecting or when improving the geometry of a vector dataset that was created based on a distorted or wrongly projected map.

The sf package implements affine transformation for objects of classes sfg and sfc.

nz_sfc = st_geometry(nz)

Shifting moves every point by the same distance in map units. It could be done by adding a numerical vector to a vector object. For example, the code below shifts all y-coordinates by 100,000 meters to the north but leaves the x-coordinates untouched (left panel on the Fig. 5.10).

nz_shift = nz_sfc + c(0, 100000)

Scaling enlarges or shrinks objects by a factor. It can be applied either globally or locally. Global scaling increases or decreases all coordinates values in relation to the origin coordinates, while keeping all geometries topological relations intact. It can by done by subtraction or multiplication of asfg or sfc object.

Local scaling treats geometries independently and requires points around which geometries are going to be scaled, e.g. centroids. In the example below, each geometry is shrunk by a factor of two around the centroids (central panel on the Fig. 5.10).

nz_centroid_sfc = st_centroid(nz_sfc)
nz_scale = (nz_sfc - nz_centroid_sfc) * 0.5 + nz_centroid_sfc

Rotation of two-dimensional coordinates requires a rotation matrix:

$R = \begin{bmatrix} \cos \theta & -\sin \theta \\ \sin \theta & \cos \theta \\ \end{bmatrix}$

It rotates points in a counterclockwise direction. The rotation matrix could be implemented in R as:

rotation = function(a){
r = a * pi/180 #degrees to radians
matrix(c(cos(r), sin(r), -sin(r), cos(r)), nrow = 2, ncol = 2)
} 

The rotation function accepts one argument a - a rotation angle in degrees. Rotation could be done around selected points, such as centroids (right panel on the Fig. 5.10). See vignette("sf3") for more examples.

nz_rotate = (nz_sfc - nz_centroid_sfc) * rotation(30) + nz_centroid_sfc

Finally, the newly created geometries can replace the old ones with the st_set_geometry() function:

nz_scale_sf = st_set_geometry(nz, nz_scale)

### 5.3.5 Clipping

Spatial clipping is a form of spatial subsetting that involves changes to the geometry columns of at least some of the affected features.

Clipping can only apply to features more complex than points: lines, polygons and their ‘multi’ equivalents. To illustrate the concept we will start with a simple example: two overlapping circles with a center point one unit away from each other and a radius of one:

b = st_sfc(st_point(c(0, 1)), st_point(c(1, 1))) # create 2 points
b = st_buffer(b, dist = 1) # convert points to circles
l = c("x", "y")
plot(b)
text(x = c(-0.5, 1.5), y = 1, labels = l) # add text

Imagine you want to select not one circle or the other, but the space covered by both x and y. This can be done using the function st_intersection(), illustrated using objects named x and y which represent the left and right-hand circles:

x = b[1]
y = b[2]
x_and_y = st_intersection(x, y)
plot(b)
plot(x_and_y, col = "lightgrey", add = TRUE) # color intersecting area

The subsequent code chunk demonstrates how this works for all combinations of the ‘Venn’ diagram representing x and y, inspired by Figure 5.1 of the book R for Data Science (Grolemund and Wickham 2016).

To illustrate the relationship between subsetting and clipping spatial data, we will subset points that cover the bounding box of the circles x and y in Figure 5.12. Some points will be inside just one circle, some will be inside both and some will be inside neither. st_sample() is used below to generate a simple random distribution of points within the extent of circles x and y, resulting in output illustrated in Figure 5.13.

bb = st_bbox(st_union(x, y))
pmat = matrix(c(bb[c(1, 2, 3, 2, 3, 4, 1, 4, 1, 2)]), ncol = 2, byrow = TRUE)
box = st_polygon(list(pmat))
set.seed(2017)
p = st_sample(x = box, size = 10)
plot(box)
text(x = c(-0.5, 1.5), y = 1, labels = l)

The logical operator way would find the points inside both x and y using a spatial predicate such as st_intersects() whereas the intersection method simply finds the points inside the intersecting region created above as x_and_y. As demonstrated below the results are identical, but the method that uses the clipped polygon is more concise:

sel_p_xy = st_intersects(p, x, sparse = FALSE)[, 1] &
st_intersects(p, y, sparse = FALSE)[, 1]
p_xy1 = p[sel_p_xy]
p_xy2 = p[x_and_y]
identical(p_xy1, p_xy2)
#> [1] TRUE

### 5.3.6 Geometry unions

As we saw in section 3.2.2, spatial aggregation can silently dissolve the geometries of touching polygons in the same group. This is demonstrated in the code chunk below in which 49 us_states are aggregated into 4 regions using base and tidyverse functions (see results in Figure 5.14):

regions = aggregate(x = us_states[, "total_pop_15"], by = list(us_states$REGION), FUN = sum, na.rm = TRUE) regions2 = us_states %>% group_by(REGION) %>% summarize(pop = sum(total_pop_15, na.rm = TRUE)) What is going on in terms of the geometries? Behind the scenes, both aggregate() and summarize() combine the geometries and disolve the boundaries between them using st_union(). This is demonstrated in the code chunk below which creates a united western US: us_west = us_states[us_states$REGION == "West", ]
us_west_union = st_union(us_west)

The function can take two geometries and union them, as demonstrated in the code chunk below which creates a united western block incorporating Texas (challenge: reproduce and plot the result):

linestring_sf2length = st_length(linestring_sf2) linestring_sf2 #> Simple feature collection with 3 features and 2 fields #> geometry type: LINESTRING #> dimension: XY #> bbox: xmin: 1 ymin: 1 xmax: 4 ymax: 5 #> epsg (SRID): NA #> proj4string: NA #> geometry name length #> 1 LINESTRING (1 5, 4 3) Riddle Rd 3.61 #> 2 LINESTRING (4 4, 4 1) Marshall Ave 3.00 #> 3 LINESTRING (2 2, 4 2) Foulke St 2.00 ## 5.4 Geometric operations on raster data Geometric raster operations include the shift, flipping, mirroring, scaling, rotation or warping of images. These operations are e.g. necessary when geolocating a raster image. In turn, geolocating requires the rectification of the image, which usually includes some manual intervention one or several of the following steps depending on the task at hand (see also Liu and Mason 2009): • Georeferencing with ground control points. • Orthorectification also georeferences an image but additionally takes into account local topography. • Image (co-)registration is the process of aligning one image with another (in terms of CRS, origin and resolution). Registration becomes necessary for images from the same scene but shot from different sensors or from different angles or at different points in time. R is unsuitable for the first two points since these often require manual intervention which is why they are usually done with the help of a Desktop GIS (see also Chapter 9). On the other hand, aligning several images is possible in R and this section shows among other how to do so. This often includes changing the extent, the resolution and the origin of an image. A matching projection is of course also required but is already covered in section 5.2.5. In any case, there are other reasons why to perform a geometric operation on a single raster image. For instance, in chapter 13 we define metropolitan areas in Germany as 20 km2 pixels with more than 500,000 inhabitants. The original inhabitant raster, however, has a resolution of 1 km2 which is why we will decrease (aggregate) the resolution by a factor of 20 (see section 13.5). Another reason for aggregating a raster is simply to decrease runtime or save disk space. Of course, this is only possible if the task at hand allows a coarser resolution. Sometimes a coarser resolution is sufficient for the task at hand. ### 5.4.1 Geometric intersections In section 3.3.1 we have shown how to extract values from a raster overlaid by other spatial objects. To retrieve a spatial output, we can use almost the same subsetting syntax. The only difference is that we have to make clear that we would like to keep the matrix structure by setting the drop-parameter to FALSE. This will return a raster object containing the cells whose midpoints overlap with clip. data("elev", package = "spData") clip = raster(nrow = 3, ncol = 3, res = 0.3, xmn = 0.9, xmx = 1.8, ymn = -0.45, ymx = 0.45, vals = rep(1, 9)) elev[clip, drop = FALSE] #> class : RasterLayer #> dimensions : 2, 1, 2 (nrow, ncol, ncell) #> resolution : 0.5, 0.5 (x, y) #> extent : 1, 1.5, -0.5, 0.5 (xmin, xmax, ymin, ymax) #> coord. ref. : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0 #> data source : in memory #> names : layer #> values : 18, 24 (min, max) For the same operation we can also use the intersect() and crop() command. ### 5.4.2 Extent and origin When merging or performing map algebra on rasters, their resolution, projection, origin and/or extent has to match. Otherwise, how should we add the values of one raster with a resolution of 0.2 decimal degrees to a second with a resolution of 1 decimal degree? The same problem arises when we would like to merge satellite imagery from different sensors with different projections and resolutions. We can deal with such mismatches by aligning the rasters. In the simplest case, two images only differ in a mismatch in extent. Following code adds one row and two columns to each side of the raster while setting all new values to an elevation of 1000 meters (5.17). data(elev, package = "spData") elev_2 = extend(elev, c(1, 2), value = 1000) plot(elev_2) Performing an algebraic operation on two objects with differing extents in R, the raster package returns the result for the intersection, and says so in a warning. elev_3 = elev + elev_2 #> Warning in elev + elev_2: Raster objects have different extents. Result for #> their intersection is returned However, we can also align the extent of two rasters with extend(). Instead of telling the function how many rows or columns should be added (as done before), we allow it to figure it out by using another raster object. Here, we extend the elev object to the extent of elev_2. The newly added rows and column receive the default value of the value parameter, i.e. NA. elev_4 = extend(elev, elev_2) The origin is the point closest to (0, 0) if you moved towards it (starting from any given cell corner of the raster) in steps of x and y resolution. origin(elev_4) #> [1] 0 0 If two rasters have different origins, their cells do not overlap completely which would make map algebra impossible. To change the origin , use origin().37 Looking at figure 5.18 reveals the effect of changing the origin. # change the origin origin(elev_4) = c(0.25, 0.25) plot(elev_4) # and add the original raster plot(elev, add = TRUE) Note that changing the resolution frequently (next section) also changes the origin. ### 5.4.3 Aggregation and disaggregation Raster datasets can also differ with regard to their resolution. To match resolutions, one can either decrease (aggregate()) or increase (disaggregate()) the resolution of one raster.38 As an example, we here change the spatial resolution of dem (found in the RQGIS package) by a factor of 5 (Fig. 5.19). Additionally, the output cell value should correspond to the mean of the input cells (note that one could use other functions as well, such as median(), sum() etc.): data("dem", package = "RQGIS") dem_agg = aggregate(dem, fact = 5, fun = mean) par(mfrow = c(1, 2)) plot(dem) plot(dem_agg) By contrast, thedisaggregate() function increases the resolution. However, we have to specify a method how to fill the new cells. The disaggregate() function provides two methods. The first (nearest neighbor, method = "") simply gives all output cells the value of the nearest input cell, and hence duplicates values which leads to a blocky output image. The bilinear method, in turn, is an interpolation technique that uses the four nearest pixel centers of the input image (salmon colored points in Fig. 5.20) to compute an average weighted by distance (arrows in Fig. 5.20 as the value of the output cell - square in the upper left corner in Fig. 5.20). dem_disagg = disaggregate(dem_agg, fact = 5, method = "bilinear") identical(dem, dem_disagg) #> [1] FALSE Comparing the values of dem and dem_disagg tells us that they are not identical (you can also use compareRaster() or all.equal()). However, this was hardly to be expected, since disaggregating is a simple interpolation technique. It is important to keep in mind that disaggregating results in a finer resolution, the corresponding values, however, are only as accurate as their lower resolution source. The process of computing values for new pixel locations is also called resampling. In fact, the raster package provides a resample() function. It lets you align several raster properties in one go, namely origin, extent and resolution. By default, it uses the bilinear-interpolation. # add 2 rows and columns, i.e. change the extent dem_agg = extend(dem_agg, 2) dem_disagg_2 = resample(dem_agg, dem) Finally, in order to align many (possibly hundreds or thousands of) images stored on disk, you could use the gdalUtils::align_rasters() function. However, you may also use raster with very large datasets. This is because raster: 1. Lets you work with raster datasets that are too large to fit into the main memory (RAM) by only processing chunks of it. 2. Tries to facilitate parallel processing. For more information see the help pages of beginCluster() and clusteR(). Additionally, check out the Multi-core functions section in vignette("functions", package = "raster"). ## 5.5 Raster-vector interactions This section focuses on interactions between raster and vector geographic data models, introduced in Chapter 2. It includes four main techniques: raster cropping and masking using vector objects (section 5.5.1); extracting raster values using different types of vector data (section 5.5.2); and raster-vector conversion (sections 5.5.3 and 5.5.4). The above concepts are demonstrated using data used in previous chapters to understand their potential real-world applications. ### 5.5.1 Raster cropping Many geographic data projects involve integrating data from many different sources, such as remote sensing images (rasters) and administrative boundaries (vectors). Often the extent of input raster datasets is larger than the area of interest. In this case raster cropping and masking are useful for unifying the spatial extent of input data. Both operations reduce object memory use and associated computational resources for subsequent analysis steps, and may be a necessary preprocessing step before creating attractive maps involving raster data. We will use two objects to illustrate raster cropping: • A raster object srtm representing elevation (meters above sea level) in Southwestern Utah. • A vector (sf) object zion representing Zion National Park. Both target and cropping objects must have the same projection. The following code chunk therefore not only loads the datasets, from the spDataLarge package installed in Chapter 2. It also reprojects zion (see section 5.2 for more on reprojection): srtm = raster(system.file("raster/srtm.tif", package = "spDataLarge")) zion = read_sf(system.file("vector/zion.gpkg", package = "spDataLarge")) zion = st_transform(zion, projection(srtm)) We will use crop() from the raster package to crop the srtm raster. crop() reduces the rectangular extent of the object passed to its first argument based on the extent of the object passed to its second argument, as demonstrated in the command below (which generates Figure 5.21:B — note the smaller extent of the raster background): srtm_cropped = crop(srtm, as(zion, "Spatial")) Related to crop() is the raster function mask(), which sets values outside of the bounds a the object passed to its second argument to NA. The following command therefore masks every cell outside of the the Zion National Park boundaries (Figure 5.21:C): srtm_masked = mask(srtm, as(zion, "Spatial")) Changing the settings of mask() yields in different results. Setting maskvalue = 0, for example, would set all pixels outside the national park to 0. Setting inverse = TRUE would mask everything inside the bounds of the park (see ?mask for details). ### 5.5.2 Raster extraction Raster extraction is the process of identifying and returning the values associated with a ‘target’ raster at specific locations, based on a (typically vector) geographic ‘selector’ object. The results depend on the type of selector used (points, lines or polygons) and arguments passed to the raster::extract() function, which we use to demonstrate raster extraction. The reverse of raster extraction — assigning raster cell values based on vector objects — is rasterization, described in section 5.5.3. The simplest example is extracting the value of a raster cell at specific points. For this purpose we will use zion_points, which contain a sample of 30 locations with the Zion National Park (Figure 5.22). The following command extracts elevation values from srtm and assigns the resulting vector to a new column (elevation) in the zion_points dataset: zion_pointselevation = raster::extract(srtm, as(zion_points, "Spatial"))

The buffer argument can be used to specify a buffer radius (in meters) around each point. The result of raster::extract(srtm, zion_points, buffer = 1000), for example, is a list of vectors, each of which representing the values of cells inside the buffer associated with each point. In practice this example is a special case of extraction with a polygon selector, described below.

Raster extraction also works with line selectors. To demonstrate this, the code below creates zion_transect, a straight line going from northwest to southeast of the Zion National Park, illustrated in Figure 5.23:A (see section 2.1 for a recap on the vector data model):

zion_transect = cbind(c(-113.2, -112.9), c(37.45, 37.2)) %>%
st_linestring() %>%
st_sfc() %>%
st_sf()

The utility of extracting heights from a linear selector is illustrated by imagining that you are planning a hike. The method demonstrated below provides an ‘elevation profile’ of the route (the line does not need to be straight), useful for estimating how long it will take due to long climbs:

transect = raster::extract(srtm, as(zion_transect, "Spatial"), along = TRUE, cellnumbers = TRUE)

Note the use of along = TRUE and cellnumbers = TRUE arguments to return cell IDs along the path. The result is a list containing a matrix of cell IDs in the first column and elevation values in the second. The subsequent code chunk first converts this tricky matrix-in-a-list object into a simple data frame, returns the coordinates associated with each extracted cell and finds the associated distances along the transect (see ?geosphere::distm() for details):

transect_df = map_dfr(transect, as_data_frame, .id = "ID")
transect_coords = xyFromCell(srtm, transect_df$cell) transect_df$dist = c(0, cumsum(geosphere::distGeo(transect_coords)))    

The resulting transect_df can be used to create elevation profiles, as illustrated in Figure 5.23:B.

The final type of geographic vector object for raster extraction is polygons. Like lines and buffers, polygons tend to return many raster values per polygon. This is demonstrated in the command below, which results in a data frame with column names ID (the row number of the polygon) and srtm (associated elevation values):

zion_srtm_values = raster::extract(x = srtm, y = as(zion, "Spatial"), df = TRUE)

Such results can be used to generate summary statistics for raster values per polygon, for example to to characterize a single region or to compare many regions. The generation of summary statistics is demonstrated the code below, which creates the object zion_srtm_df containing summary statistics for elevation values in Zion National Park (see 5.24:A):

group_by(zion_srtm_values, ID) %>%
summarize_at(vars(srtm), funs(min, mean, max))
#> # A tibble: 1 x 4
#>      ID   min  mean   max
#>   <dbl> <dbl> <dbl> <dbl>
#> 1    1. 1122. 1818. 2661.

The preceding code chunk used the tidyverse to provide summary statistics for cell values per polygon ID, as described in Chapter 3. The results provide useful summaries, for example that the maximum height in the park is around 2,661 meters (other summary statistics such as standard deviation can also be calculated in this way). Because there is only one polygon in the example a data frame with a single row is returned, but the method works when multiple selector polygons are used.

The same approach works for counting occurrences of categorical raster values within polygons. This is illustrated with a land cover dataset (nlcd) from the spDataLarge package in 5.24:B and demonstrated in the code below:

zion_nlcd = raster::extract(nlcd, as(zion, "Spatial"), df = TRUE, factors = TRUE)
dplyr::select(zion_nlcd, ID, levels) %>%
gather(key, value, -ID) %>%
group_by(ID, key, value) %>%
tally() %>%
#> # A tibble: 1 x 9
#> # Groups:   ID, key [1]
#>      ID key    Barren Cultivated Developed  Forest Herbaceous Shrubland
#>   <dbl> <chr>   <dbl>      <dbl>     <dbl>   <dbl>      <dbl>     <dbl>
#> 1    1. levels 98285.        62.     4205. 298299.       235.   203701.
#> # ... with 1 more variable: Wetlands <dbl>

So far we have seen how raster::extract() is a flexible way of extracting raster cell values from a range of input geographic objects. An issue with the function, however, is that it is slow. If this is a problem it is useful to know about alternatives and work-arounds, three of which are presented below.

• Parallelization: this approach works when using many geographic vector selector objects by splitting them into groups and extracting cell values independently for each group (see ?raster::clusterR() for details of this approach).
• Use the velox package (Hunziker 2017), which provides a fast method for extracting raster data that fits in memory (see the packages extract vignette for details).
• Using R-GIS bridges (see Chapter 9): efficient calculation of raster statistics from polygons can be found in the SAGA function saga:gridstatisticsforpolygons, for example, which can be accessed via RQGIS.

### 5.5.3 Rasterization

Rasterization is the conversion of vector objects into their representation in raster objects. Usually, the output raster is used for quantitative analysis (e.g. analysis of terrain) or modeling. As we saw in Chapter 2 the raster data model has some characteristics that make it conducive to certain methods. Furthermore, the process of rasterization can help simplify datasets because the resulting values all have the same spatial resolution: rasterization can be seen as a special type of geographic data aggregation.

The raster package contains the function rasterize() for doing this work. Its first two arguments are x, vector object to be rasterized and y, a ‘template raster’ object defining the extent, resolution and CRS of the output. The geographic resolution of the input raster has a major impact on the results: if it is too low (cell size is too large) the result may miss the full geographic variability of the vector data; if it is too high computational times may be excessive. There are no simple rules to follow when deciding an appropriate geographic resolution, which is heavily dependent on the intended use of the results.

To demonstrate rasterization in action, we will use a template raster that has the same extent and CRS as the input vector data cycle_hire_osm_projected (a dataset on cycle hire points in London illustrated in Figure 5.25:A) and spatial resolution of 1000 meters:

cycle_hire_osm_projected = st_transform(cycle_hire_osm, 27700)
raster_template = raster(extent(cycle_hire_osm_projected), resolution = 1000,
crs = st_crs(cycle_hire_osm_projected)$proj4string) Rasterization is a very flexible operation: the results depend not only on the nature of the template raster, but also on the type of input vector (e.g. points, polygons) and a variety arguments taken by the rasterize() function. To illustrate this flexibility we will try three different approaches rasterization. First we create a raster representing the presence or absence of cycle hire points (known as presence/absence rasters). In this case rasterize() requires only one argument in addition to x and y (the aformentioned vector and raster objects): a value to be transferred to all non-empty cells specified by field (results illustrated Figure 5.25:B). ch_raster1 = rasterize(cycle_hire_osm_projected, raster_template, field = 1) The fun argument specifies summary statistics used to covert multiple observations in close proximity into associate cells in the raster object. By default fun = 'last is used but other options such as fun = "count" can be used, in this case to count the number of cycle hire points in each grid cell (the results of this operation are illustrated in Figure 5.25:C). ch_raster2 = rasterize(cycle_hire_osm_projected, raster_template, field = 1, fun = "count") The new output, ch_raster2, shows the number of cycle hire points in each grid cell. The cycle hire locations have different numbers of bicycles described by the capacity variable, raising the question, what’s the capacity in each grid cell? To calculate that we must sum the field ("capacity"), resulting in output illustrated in Figure 5.25:D, calculated with the following command (other summary functions such as mean could be used): ch_raster3 = rasterize(cycle_hire_osm_projected, raster_template, field = "capacity", fun = sum) Another dataset based on California’s polygons and borders (created below) illustrates raterization of lines. After casting the polygon objects into a multilinestring, a template raster is created, with a resolution of a 0.5 degree: california = dplyr::filter(us_states, NAME == "California") california_borders = st_cast(california, "MULTILINESTRING") raster_template2 = raster(extent(california), resolution = 0.5, crs = st_crs(california)$proj4string)

Line rasterization is demonstrated in the code below. In the resulting raster, all cells that are touched by a line get a value, as illustrated in Figure 5.26:A.

california_raster1 = rasterize(california_borders, raster_template2)

Polygon rasterization, by contrast, selects only cells whose centroids are inside the selector polygon, as illustrated in Figure 5.26:B.

california_raster2 = rasterize(california, raster_template2)

As with raster::extract(), raster::rasterize() works well for most cases but is not performance optimized. Fortunately, there are several alternatives, including the fasterize::fasterize() and gdalUtils::gdal_rasterize(). The former is much (100 times+) faster than rasterize() but is currently limited to polygon rasterization. The latter is part of GDAL and therefore requires a vector file (instead of an sf object) and rasterization parameters (instead of a Raster* template object) as inputs.39

### 5.5.4 Spatial vectorization

Spatial vectorization is the counterpart of rasterization 5.5.3, but in the opposite direction. It involves converting spatially continuous raster data into spatially discrete vector data such as points, lines or polygons.

Be careful with the wording! In R vectorization refers to the possibility of replacing for-loops and alike by doing things like 1:10 / 2 (see also Wickham (2014)).

The simplest form of vectorization is to convert the centroids of raster cells into points. rasterToPoints() does exactly this for all non-NA raster grid cells (Figure 5.27). Setting the spatial parameter to TRUE ensures the output is a spatial object, not a matrix.

elev_point = rasterToPoints(elev, spatial = TRUE) %>%
st_as_sf()

Another common type of spatial vectorization is the creation of contour lines representing lines of continuous height or temperatures (isotherms) for example. We will use a real-world digital elevation model (DEM) because the artificial raster elev produces parallel lines (task: verify this and explain why this happens). Contour lines can be created with the raster function rasterToContour(), which is itself a wrapper around contourLines(), as demonstrated below:

# not shown
data(dem, package = "RQGIS")
plot(dem, axes = FALSE)
contour(dem, add = TRUE)

Contours can be added to existing plots with functions such as contour(), rasterVis::contourplot() or tmap::tm_iso(). As illustrated in Figure 5.28 (which was created using the tmap package described in Chapter 8), isolines can be labelled.

The final type of vectorisation involves conversion of rasters to polygons. This can be done with raster::rasterToPolygons(), wich converts each raster cell into a polygon consisting of five coordinates, all of which are stored in memory (explaining why rasters are often fast compared with vectors!)

This is illustrated below by converting the grain object into polygons and subsequently dissolving borders between polygons with the same attribute values. Attributes in this case are stored in a collumn called layer (see section 5.3.6 and Figure 5.29). (Note: a convenient alternative for converting rasters into polygons is spex::polygonize() which by default returns an sf object.)

grain_poly = rasterToPolygons(grain) %>%
st_as_sf()
grain_poly2 = grain_poly %>%
group_by(layer) %>%
summarize()

## 5.6 Exercises

1. Create a new object called nz_wgs by transforming nz object into the WGS84 CRS.
• Create an object of class crs for both and use this to query their CRSs.
• With reference to the bounding box of each object, what units does each CRS use?
• Remove the CRS from nz_wgs and plot the result: what is wrong with this map of New Zealand and why?
2. Transform the world dataset to the transverse Mercator projection ("+proj=tmerc") and plot the result. What has changed and why? Try to transform it back into WGS 84 and plot the new object. Why does the new object differ from the original one?

3. Generate and plot simplified versions of the nz dataset. Experiment with different values of keep (ranging from 0.5 to 0.00005) for ms_simplify() and dTolerance (from 100 to 100,000) st_simplify() .
• At what value does the form of the result start to break-down for each method, making New Zealand unrecognizable?
• Advanced: What is different about the geometry type of the results from st_simplify() compared with the geometry type of ms_simplify()? What problems does this create and how can this be resolved?
4. In the first exercise in Chapter 4 it was established that Canterbury region had 61 of the 101 highest points in New Zealand. Using st_buffer(), how many points in nz_height are within 100 km of Canturbury?

5. Find the geographic centroid of New Zealand. How far is it from the geographic centroid of Canterbury?

6. Most world maps have a north-up orientation. A world map with a south-up orientation could be created by a reflection (one of the affine transformations not mentioned in 5.3.4) of the world object’s geometry. Write code to do so. Hint: you need to use a two-element vector for this transformation.
• Bonus: create a upside down map of your country.
7. Transform the continuous raster (cat_raster) into WGS 84 using the nearest neighbor interpolation method. What has changed? How does it influence the results?

8. Transform the categorical raster (cat_raster) into WGS 84 using the bilinear interpolation method. What has changed? How does it influence the results?

9. Subset the point in p that is contained within x and y (see section 5.3.5 and Figure 5.12).
• Using base subsetting operators.
• Using an intermediary object created with st_intersection().
10. Calculate the length of the boundary lines of US states in meters. Which state has the longest border and which has the shortest? Hint: The st_length function computes the length of a LINESTRING or MULTILINESTRING geometry.

11. Aggregate the raster counting high points in New Zealand (created in the previous exercise), reduce its geographic resolution by half (so cells are 6 by 6 km) and plot the result.
• Resample the lower resolution raster back to a resolution of 3 km. How have the results changed?

The next two exercises will use a vector (random_points) and raster dataset (ndvi) from the RQGIS package. It also uses a polygonal ‘convex hull’ derived from the vector dataset (ch) to represent the area of interest:

library(RQGIS)
data(random_points)
data(ndvi)
ch = st_combine(random_points) %>%
st_convex_hull()
1. Crop the ndvi raster using (1) the random_points dataset and (2) the ch dataset. Are there any difference in the output maps? Next, mask ndvi using these two datasets. Can you see any difference now? How can you explain that?

2. Firstly, extract values from ndvi at the points represented in random_points. Next, extract average values of ndvi using a 90 buffer around each point from random_points and compare these two sets of values. When would extracting values by buffers be more suitable than by points alone?

3. Subset points higher than 3100 meters in New Zealand (the nz_height object) and create a template raster with a resolution of 3km. Using these objects:
• Count numbers of the highest points in each grid cell.
• Find the maximum elevation in each grid cell.
4. Polygonize the grain dataset and filter all squares representing clay.
• Name two advantages and disadvantages of vector data over raster data.
• At which points would it be useful to convert rasters to vectors in your work?

### References

Maling, D. H. 1992. Coordinate Systems and Map Projections. 2nd ed. Oxford ; New York: Pergamon Press.

Bivand, Roger S., Edzer Pebesma, and Virgilio Gómez-Rubio. 2013. Applied Spatial Data Analysis with R. 2nd ed. 2013 edition. New York: Springer.

Jenny, Bernhard, Bojan Šavrič, Nicholas D Arnold, Brooke E Marston, and Charles A Preppernau. 2017. “A Guide to Selecting Map Projections for World and Hemisphere Maps.” In Choosing a Map Projection, edited by Miljenko Lapaine and Lynn Usery, 213–28. Springer.

Douglas, David H, and Thomas K Peucker. 1973. “Algorithms for the Reduction of the Number of Points Required to Represent a Digitized Line or Its Caricature.” Cartographica: The International Journal for Geographic Information and Geovisualization 10 (2): 112–22.

Visvalingam, M., and J. D. Whyatt. 1993. “Line Generalisation by Repeated Elimination of Points.” The Cartographic Journal 30 (1): 46–51. https://doi.org/10.1179/000870493786962263.

Grolemund, Garrett, and Hadley Wickham. 2016. R for Data Science. 1 edition. O’Reilly Media.

Liu, Jian-Guo, and Philippa J. Mason. 2009. Essential Image Processing and GIS for Remote Sensing. Chichester, West Sussex, UK ; Hoboken, NJ: Wiley-Blackwell.

Hunziker, Philipp. 2017. Velox: Fast Raster Manipulation and Extraction.

1. The CRS can also be added when creating sf objects with the crs argument (e.g. st_sf(geometry = st_sfc(st_point(c(-0.1, 51.5))), crs = 4326)). The same argument can also be used to set the CRS when creating raster datasets (e.g. raster(crs = "+proj=longlat")).

2. For a short description of the most relevant projection parameters and related concepts, see the fourth lecture by Jochen Albrecht: geography.hunter.cuny.edu/~jochen/GTECH361/lectures/ as well as http://proj4.org/parameters.html. Another great resource on projection definitions is http://spatialreference.org/.

3. The difference in location between the two points is not due to imperfections in the transforming operation (which is in fact very accurate) but the low precision of the manually-created coordinates that created london and london_proj. Also surprising may be that the result is provided in a matrix with units of meters. This is because st_distance() can provide distances between many features and because the CRS has units of meters. Use as.numeric() to coerce the result into a regular number.

4. Many thanks to an anonymous reviewer whose comments formed the basis of this advice.

5. Credit: Josh O’Brien’s post on stackoverflow inspired this function.

6. See the Wikipedia page ‘List of map projections’ for 70+ projections, and illustrations.

7. This projection is used, among others, by the National Geographic Society.

8. Full list of NLCD2011 land cover classes can be found at https://www.mrlc.gov/nlcd11_leg.php

9. Another minor change, that is not represented in Table 5.2, is that the class of the values in the new projected raster dataset is numeric. This is because the bilinear method works with continuous data and the results are rarely coerced into whole integer values. This can have implications for file sizes when raster datasets are saved.

10. Simplification of multipolygon objects could remove small internal polygons, even if the keep_shapes argument is set to TRUE. To prevent this, you need to set explode = TRUE. This option converts all mutlipolygons into separate polygons before its simplification.

11. A description of how st_point_on_surface() works is provided at gis.stackexchange.com/q/76498.

12. If the origins of two raster datasets are just marginally apart, it sometimes is sufficient to simply increase the tolerance argument of raster::rasterOptions().

13. It is important to note that we here are solely referring to the spatial resolution but that in remote sensing the spectral (spectral bands), temporal (observations through time of the same area) and radiometric (color depth) resolution are equally important. Check out the stackApply() example in the documentation for getting an idea how to do temporal raster aggregation.

14. See more at http://www.gdal.org/gdal_rasterize.html.