8 Location analysis

Prerequisites

  • This chapter requires the following packages (ggmap must also be installed):
  • Required data will be downloaded in due course.

8.1 Introduction

This chapter demonstrates how the skills learned in Part I can be applied to a particular domain: location analysis (also called geomarketing). This is a broad field of research and commercial application, the aim of which is usually to decide the optimal location for new services. A typical example is where to locate a new shop. The aim here is to attract most visitors and, ultimately, make most profit. There are also many non-commercial applications that can use the technique for public benefit, for example where to locate new health services (Tomintz, Clarke, and Rigby 2008).

People are fundamental to location analysis, in particular where they are likely to spend their time and other resources. Interestingly, ecological concepts and models are quite similar to those used for store location analysis. Animals and plants can best meet their needs in certain ‘optimal’ locations, based on variables that change over space (Muenchow et al. (2018); see also chapter ??) . This is one of the great strengths of geocomputation and GIScience in general. Concepts and methods are transferable to other fields. Polar bears, for example, prefer northern latitudes where temperatures are lower and food (seals and sea lions) is plentiful. Similarly, humans tend to congregate certain places, creating economic niches (and high land prices) analogous to the ecological niche of the Arctic. The main task of location analysis is to find out where such ‘optimal locations’ are for specific services, based on available data. Typical research questions include:

  • Where do target groups live and which areas do they frequent?
  • Where are competing stores or services located?
  • How many people can easily reach specific stores?
  • Do existing services over or under-exploit the market potential?
  • What is the market share of a company in a specific area?

This chapter demonstrates how geocomputation can answer such questions based on a hypothetical case study based on real data.

8.2 Case study: bike shops in Germany

Imagine you are starting a chain of bike shops in Germany. The stores should be placed in urban areas with as many potential customers as possible. Additionally, a survey53 suggests that single young males (aged 20 to 40) are most likely to buy your products: this is the target audience. You are in the lucky position to have sufficient capital to open a number of shops. But where should they be placed? Consulting companies (employing location analysts) would happily charge high rates to answer such questions. Luckily, we can do so ourselves with the help of open data and open source software. The following sections will demonstrate how the techniques learned during the first chapters of the book can be applied to undertake the following steps:

  • Tidy the input data from the German census (section 8.3).
  • Convert the tabulated census data into raster objects (section 8.4).
  • Identify metropolitan areas with high population densities (section 8.5).
  • Download detailed geographic data (from OpenStreetMap, with osmdata) for these areas (section 8.6).
  • Create rasters for scoring the relative desirability of different locations using map algebra (section 8.7).

Although we have applied these steps to a specific case study, they could be generalized to many scenarios of store location or public service provision.

8.3 Tidy the input data

The German government provides gridded census data at either 1 km or 100 m resolution. The following code chunk downloads, unzips and reads-in the 1 km data.

The census_de object is a data frame containing 13 variables for more than 300,000 grid cells across Germany. For our work we only need a subset of these: Easting and Northing, number of inhabitants (population), mean average age, proportion of women and average household size. These variables and selected and renamed in the code chunk below and summarized in Table 8.1. Further, mutate_all() is used to convert values -1 and -9 (meaning unknown) to NA.

Table 8.1: Excerpt from the data description ‘Datensatzbeschreibung_klassierte_Werte_1km-Gitter.xlsx’ located in the downloaded file census.zip describing the classes of the retained variables. The classes -1 and -9 refer to uninhabited areas or areas which have to be kept secret, for example due to the need to preserve anonymity.
class population
(number of people)
women
(%)
mean age
(years)
household size
(number of people)
1 3-250 0-40 0-40 1-2
2 250-500 40-47 40-42 2-2.5
3 500-2000 47-53 42-44 2.5-3
4 2000-4000 53-60 44-47 3-3.5
5 4000-8000 >60 >47 >3.5
6 >8000

8.4 Create census rasters

After the preprocessing, the data can be converted into a raster stack or brick (see sections 2.2.3 and 3.3.1). rasterFromXYZ() makes this really easy. It requires an input data frame where the first two columns represent coordinates on a regular grid. All the remaining columns (here: pop, women, mean_age, hh_size) will serve as input for the raster brick layers (Figure 8.1; see also code/08-location-jm.R).

Note that we are using an equal-area projection (EPSG:3035; Lambert Equal Area Europe), i.e. a projected CRS where each grid cell has the same area, here 1000 x 1000 square meters. Since we are using mainly densities such as the number of inhabitants or the portion of women per grid cell, it is of utmost importance that the area of each grid cell is the same to avoid ‘comparing apples and oranges’. Be careful with geographic CRS where grid cell areas constantly decrease in poleward directions (see also sections 2.3 and 5.2).
Gridded German census data of 2011. See Table \@ref(tab:census-desc) for a description of the classes.

Figure 8.1: Gridded German census data of 2011. See Table 8.1 for a description of the classes.

The next stage is to reclassify the values of the rasters stored in input_ras in accordance with the survey mentioned in section 8.2, using the raster function reclassify(), which was introduced in section 4.3.3. In the case of the population data we convert the classes into a numeric data type using class means. Raster cells are assumed to have a population of 127 if they have a value of 1 (cells in ‘class 1’ contain between 3 and 250 inhabitants) and 375 if they have a value of 2 (containing 250 to 500 inhabitants), and so on (see Table 8.1). A cell value of 8000 inhabitants was chosen for ‘class 6’ because these cells contain more than 8000 people. Of course, these are approximations of the true population, not precise values.54 However, the level of detail is sufficient to delineate metropolitan areas (see next section).

In contrast to the pop variable, representing absolute estimates of the total population, the remaining variables were re-classified as weights corresponding with weights used in the survey. Class 1 in the variable women, for instance, represents areas in which 0 to 40% of the population is female; these are reclassified with a comparatively high weight of 3 because the target demographic is predominantly male. Similarly, the classes containing the youngest people and highest proportion of single households are reclassified to have high weights.

We can loop with map2(), the purrr version of base R’s mapply(), in parallel over two vectors (here lists; for more information please refer to Wickham (2014) and Grolemund and Wickham (2016)). Note that we have to transform the raster brick into a list for the loop to work. Finally, we convert the output list back into a raster stack.

8.5 Define metropolitan areas

We define metropolitan areas as pixels of 20 km2 inhabited by more than 500,000 people. Pixels at this coarse resolution can rapidly be created using aggregate(), as introduced in section 5.4.2. The command below uses the argument fact = 20 to reduce the resolution of the result twenty-fold (recall the original raster resolution was 1 km2):

The next stage is to keep only cells with more than half a million people, and convert the result cells into a vector object of class sf.

Plotting these polygons reveals eight metropolitan regions (Fig. 8.2). Each region consists of one or more polygons (raster cells). It would be nice if we could join all polygons belonging to one region. One approach is to union the polygons (see section 5.3.5).

This returns one multipolygon feature with its elements corresponding to the metropolitan regions. To extract these polygons from the multipolygon, we can use st_cast().

However, visual inspection reveals eight metropolitan areas whereas the unioning-casting approach comes up with nine. This is because one polygon just touches the corner of another polygon (western Germany, Cologne/Düsseldorf area; Fig. 8.2).

One could assign it to the neighboring region using a dissolving procedure, however, we leave this as an exercise to the reader, and simply delete the offending polygon.

The aggregated population raster (resolution: 20 km) with the identified metropolitan areas (golden polygons) and the corresponding names.

Figure 8.2: The aggregated population raster (resolution: 20 km) with the identified metropolitan areas (golden polygons) and the corresponding names.

The defined metropolitan areas (Fig. 8.2; see also code/08-location-jm.R) suitable for bike shops are still missing a name. A reverse geocoding approach can settle this problem. Given a coordinate, reverse geocoding finds the corresponding address. Consequently, extracting the centroid coordinate of each metropolitan area can serve as an input for a reverse geocoding API. The ggmap package makes use of the one provided by Google.55 ggmap::revgeocode() only accepts geographical coordinates (latitude/longitude), therefore, the first requirement is to bring the metropolitan polygons into an appropriate coordinate reference system (chapter 5).

Additionally, ggmap::revgeocode() only accepts one coordinate at a time, which is why we iterate over each coordinate of coords via a loop (map_dfr()). map_dfr() does exactly the same as lapply() except for returning a data.frame instead of a list.56 Sometimes, Google’s reverse geocoding API is unable to find an address returning NA. Usually, trying the same coordinate again returns an address at the second or third attempt (see while()-loop). However, if three attempts have already failed, this is a good indication that the requested information is indeed unavailable. Since we aim to be good cyberspace citizens, we try not to overburden the server with too many queries within a short amount of time by letting the loop sleep between one and four seconds after each iteration before accessing the reverse geocoding API again.

Choosing more as revgeocode()’s output option will give back a data.frame with several columns referring to the location including the address, locality and various administrative levels. Overall, we are satisfied with the locality column serving as metropolitan names (München, Nürnberg, Stuttgart, Frankfurt, Hamburg, Berlin, Leipzig) apart from one exception, namely Velbert. Hence, we replace Velbert with the corresponding name in the administrative_area_level_2 column, that is Düsseldorf (Fig. 8.2). Umlauts like ü might lead to trouble further on, for example when determining the bounding box of a metropolitan area with opq() (see further below), which is why we replace them.

8.6 Points of interest

The osmdata package provides a fantastic and easy-to-use interface to download OSM data (see also section 6.2). Instead of downloading all shops for the whole of Germany, we restrict the download to the defined metropolitan areas. This relieves the OSM server resources, reduces download time and above all only gives back the shop locations we are interested in. The map() loop, the lapply() equivalent of purrr, runs through all eight metropolitan names which subsequently define the bounding box in the opq() function (see section 6.2). Alternatively, we could have provided the bounding box in the form of coordinates ourselves. Next, we indicate that we would only like to download shop features (see this page for a full list of OpenStreetMap map features). osmdata_sf() returns a list with several spatial objects (points, lines, polygons, etc.). Here, we will only keep the point objects. As with Google’s reverse geocode API, the OSM-download will sometimes fail at the first attempt. The while loop increases the number of download trials to three. If then still no features can be downloaded, it is likely that either there are none available or that another error has occurred before (e.g. due to erroneous output from opq()).

It is highly unlikely that there are no shops in any of our defined metropolitan areas. The following if condition simply checks if there is at least one shop for each region. If not, we would try to download the shops again for this/these specific region/s.

To make sure that each list element (an sf data frame) comes with the same columns, we only keep the osm_id and the shop columns with the help of another map loop. This is not a given since OSM contributors are not equally meticulous when collecting data. Finally, we rbind all shops into one large sf object.

It would have been easier to simply use map_dfr(). Unfortunately, so far it does not work in harmony with sf objects.

The only thing left to do is to convert the spatial point object into a raster (see section 11.4). The sf object, shops, is converted into a raster having the same parameters (dimensions, resolution, CRS) as the reclass object. Importantly, the count() function is used here to calculate the number shops in each cell.

If the shop column were used instead of the osm_id column, we would have retrieved fewer shops per grid cell. This is because the shop column contains NA values, which the count() function omits when rasterizing vector objects.

The result of the subsequent code chunk is therefore an estimate of shop density (shops/km2). st_transform() is used before rasterize() to ensure the CRS of both inputs match.

As with the other raster layers (population, women, mean age, household size) the poi raster is reclassified into four classes (see section 8.4). Defining class intervals is an arbitrary undertaking to a certain degree. One can use equal breaks, quantile breaks, fixed values or others. Here, we choose the Fisher-Jenks natural breaks approach which minimizes within-class variance, the result of which provides an input for the reclassification matrix.

8.7 Identifying suitable locations

The only steps that remain before combining all the layers are to add POI and delete the population from the raster stack. The reasoning for the latter is twofold. First of all, we have already delineated metropolitan areas, that is areas where the population density is above average compared to the rest of Germany. Secondly, though it is advantageous to have many potential customers within a specific catchment area, the sheer number alone might not actually represent the desired target group. For instance, residential tower blocks are areas with a high population density but not necessarily with a high purchasing power for expensive cycle components. This is achieved with the complimentary functions addLayer() and dropLayer():

In common with other data science projects, data retrieval and ‘tidying’ have consumed much of the overall workload so far. With clean data the final step, calculating a final score by summing up all raster layers, can be accomplished in a single line.

For instance, a score greater than 9 might be a suitable threshold indicating raster cells where a bike shop could be placed (Figure 8.3; see also code/08-location-jm.R).

Figure 8.3: Suitable areas (i.e. raster cells with a score > 9) in accordance with our hypothetical survey for bike stores in Berlin.

8.8 Discussion and next steps

The presented approach is a typical example of the normative usage of a GIS (Longley et al. 2015). We combined survey data with expert-based knowledge and assumptions (definition of metropolitan areas, defining class intervals, definition of a final score threshold). It should be clear that this approach is not suitable for scientific knowledge advancement but is a very applied way of information extraction. This is to say, we can only suspect based on common sense that we have identified areas suitable for bike shops. However, we have no proof that this is in fact the case.

A few other things remained unconsidered but might improve the analysis:

  • We used equal weights when calculating the final scores. But is, for example, the household size as important as the portion of women or the mean age?
  • We used all points of interest. Maybe it would be wiser to use only those which might be interesting for bike shops such as do-it-yourself, hardware, bicycle, fishing, hunting, motorcycles, outdoor and sports shops (see the range of shop values available on the OSM Wiki).
  • Data at a better resolution may change and improve the output. For example, there is also population data at a finer resolution (100 m; see exercises).
  • We have used only a limited set of variables. For example, the INSPIRE geoportal might contain much more data of possible interest to our analysis (see also section 6.2. The bike paths density might be another interesting variable as well as the purchasing power or even better the retail purchasing power for bikes.
  • Interactions remained unconsidered, such as a possible interaction between the portion of men and single households. However, to find out about such an interaction we would need customer data.

In short, the presented analysis is far from perfect. Nevertheless, it should have given you a first impression and understanding of how to obtain, and deal with spatial data in R within a location analysis context.

Finally, we have to point out that the presented analysis would be merely the first step of finding suitable locations. So far we have identified areas, 1 by 1 km in size, potentially suitable for a bike shop in accordance with our survey. We could continue the analysis as follows:

  • Find an optimal location based on number of inhabitants within a specific catchment area. For example, the shop should be reachable for as many people as possible within 15 minutes of traveling bike distance (catchment area routing). Thereby, we should account for the fact that the further away the people are from the shop, the more unlikely it becomes that they actually visit it (distance decay function).
  • Also it would be a good idea to take into account competitors. That is, if there already is a bike shop in the vicinity of the chosen location, one has to distribute possible customers (or sales potential) between the competitors (Huff 1963; Wieland 2017).
  • We need to find suitable and affordable real estate (accessible, parking spots, frequency of passers-by, big windows, etc.).

8.9 Exercises

  1. We have used raster::rasterFromXYZ() to convert a input_tidy into a raster brick. Try to achieve the same with the help of the sp::gridded() function.

  2. In the text we have deleted one polygon of the metros object (polygon number 5) since it only touches the border of another polygon. Recreate the metros object and instead of deleting polygon number 5, make it part of the Cologne/Düsseldorf metropolitan region (hint: create a column named region_id, add polygon number 5 to the Cologne/Düsseldorf area and dissolve).

  3. Download the csv file containing inhabitant information for a 100 m cell resolution (https://www.zensus2011.de/SharedDocs/Downloads/DE/Pressemitteilung/DemografischeGrunddaten/csv_Bevoelkerung_100m_Gitter.zip?__blob=publicationFile&v=3). Please note that the unzipped file has a size of 1.23 GB. To read it into R you can use readr::read_csv. This takes 30 seconds on my machine (16 GB RAM) data.table::fread() might be even faster, and returns an object of class data.table(). Use as.tibble() to convert it into a tibble. Build an inhabitant raster, aggregate it to a cell resolution of 1 km, and compare the difference with the inhabitant raster (inh) we have created using class mean values.

  4. Suppose our bike shop predominantly sold electric bikes to older people. Change the age raster accordingly, repeat the remaining analyses and compare the changes with our original result.

References

Tomintz, Melanie N M.N., Graham P Clarke, and Janette E J.E. Rigby. 2008. “The Geography of Smoking in Leeds: Estimating Individual Smoking Rates and the Implications for the Location of Stop Smoking Services.” Area 40 (3): 341–53.

Muenchow, Jannes, Petra Dieker, Jürgen Kluge, Michael Kessler, and Henrik von Wehrden. 2018. “A Review of Ecological Gradient Research in the Tropics: Identifying Research Gaps, Future Directions, and Conservation Priorities.” Biodiversity and Conservation 27 (2): 273–85. https://doi.org/10.1007/s10531-017-1465-y.

Wickham, Hadley. 2014. Advanced R. CRC Press.

Grolemund, Garrett, and Hadley Wickham. 2016. R for Data Science. 1 edition. O’Reilly Media.

Longley, Paul, Michael Goodchild, David Maguire, and David Rhind. 2015. Geographic Information Science & Systems. Fourth edition. Hoboken, NJ: Wiley.

Huff, David L. 1963. “A Probabilistic Analysis of Shopping Center Trade Areas.” Land Economics 39 (1): 81–90. https://doi.org/10.2307/3144521.

Wieland, Thomas. 2017. “Market Area Analysis for Retail and Service Locations with MCI.” The R Journal 9 (1): 298–323.


  1. This is a hypothetical survey, i.e. it never took place.

  2. The potential error introduced during this reclassification stage will be explored in the exercises.

  3. Note that Google allows each user to access its services on a free basis for a maximum of 2500 queries a day.

  4. To learn more about the split-apply-combine strategy for data analysis, we refer the reader to Wickham (2011).