Chapter 4 Geoprocessing in R

For this part, we will use mostly {sf} capabilities. Most spatial functions (if not all) start with the st_ (for spatial type) prefix like in PostGIS.

We will focus on vector data from the {spData} package.

library(dplyr)
library(sf)
library(spData)
library(here)

4.1 Read / write

{sf} provides the st_read and st_write functions to access geospatial files. They can operate with any vector driver provided by GDAL.

We will use the data from the {spData}¹ package. Let’s see what it contains :

list.files(system.file("shapes", package = "spData"))

 [1] "auckland.dbf"           "auckland.shp"          
 [3] "auckland.shx"           "baltim.dbf"            
 [5] "baltim.shp"             "baltim.shx"            
 [7] "boston_tracts.dbf"      "boston_tracts.prj"     
 [9] "boston_tracts.shp"      "boston_tracts.shx"     
[11] "columbus.dbf"           "columbus.shp"          
[13] "columbus.shx"           "cycle_hire.geojson"    
[15] "cycle_hire_osm.geojson" "eire.dbf"              
[17] "eire.shp"               "eire.shx"              
[19] "NY8_bna_utm18.gpkg"     "NY8_utm18.dbf"         
[21] "NY8_utm18.prj"          "NY8_utm18.shp"         
[23] "NY8_utm18.shx"          "sids.dbf"              
[25] "sids.shp"               "sids.shx"              
[27] "wheat.dbf"              "wheat.shp"             
[29] "wheat.shx"              "world.dbf"             
[31] "world.gpkg"             "world.prj"             
[33] "world.shp"              "world.shx"

We will work on cycle hires in London so let’s get that data.

4.1.1 Cycle hire dataset

cycle_hire <- st_read(system.file("shapes/cycle_hire.geojson", package="spData"))

Reading layer `cycle_hire' from data source `D:\Roelandt\Documents\R\win-library\3.5\spData\shapes\cycle_hire.geojson' using driver `GeoJSON'
Simple feature collection with 742 features and 5 fields
geometry type:  POINT
dimension:      XY
bbox:           xmin: -0.2367699 ymin: 51.45475 xmax: -0.002275 ymax: 51.54214
epsg (SRID):    4326
proj4string:    +proj=longlat +datum=WGS84 +no_defs

Here we can see a couple of functions :

st_read() is the reading function from {sf}
system.file() is a function that allow us to look for data in packages, independently of the operating system.

By default, {sf} provides informations when loading the dataset. We can see it contains 742 features and 5 fields. It is points with lat/lon coordinates (we can see it through dataset SRID : 4326). {sf} also provides the bounding box and if possible the proj4string.

Here is the description of the dataset we can find in the documentation:

cycle_hire dataset

_Description: Points representing cycle hire points accross London.

_Format:

id Id of the hire point
name Name of the point
area Area they are in
nbikes The number of bikes currently parked there
nempty The number of empty places
geometry sfc_POINT

_Source: cyclehireapp.com/cyclehirelive/cyclehire.csv

We can see that there is the bike parked, the count of empty slots but not the total amount of bike slots. Let’s create a new slots column for this.

cycle_hire <- cycle_hire %>% 
  mutate(slots = nbikes +  nempty)

Let’s load polygon data, in this case London’s boroughs stored in the lnd dataset.

4.1.2 Boroughs of London

This dataset is contained in a R data format so the loading is different.

data(lnd) # load the dataset in memory
lnd       # call the dataset to visualise the 10 first features

Simple feature collection with 33 features and 7 fields
geometry type:  MULTIPOLYGON
dimension:      XY
bbox:           xmin: -0.5103751 ymin: 51.28676 xmax: 0.3340155 ymax: 51.69187
epsg (SRID):    4326
proj4string:    +proj=longlat +datum=WGS84 +no_defs
First 10 features:
                   NAME  GSS_CODE  HECTARES NONLD_AREA ONS_INNER SUB_2009
1  Kingston upon Thames E09000021  3726.117      0.000         F     <NA>
2               Croydon E09000008  8649.441      0.000         F     <NA>
3               Bromley E09000006 15013.487      0.000         F     <NA>
4              Hounslow E09000018  5658.541     60.755         F     <NA>
5                Ealing E09000009  5554.428      0.000         F     <NA>
6              Havering E09000016 11445.735    210.763         F     <NA>
7            Hillingdon E09000017 11570.063      0.000         F     <NA>
8                Harrow E09000015  5046.330      0.000         F     <NA>
9                 Brent E09000005  4323.270      0.000         F     <NA>
10               Barnet E09000003  8674.837      0.000         F     <NA>
   SUB_2006                       geometry
1      <NA> MULTIPOLYGON (((-0.3306791 ...
2      <NA> MULTIPOLYGON (((-0.06402124...
3      <NA> MULTIPOLYGON (((0.01213094 ...
4      <NA> MULTIPOLYGON (((-0.2445624 ...
5      <NA> MULTIPOLYGON (((-0.4118327 ...
6      <NA> MULTIPOLYGON (((0.1586928 5...
7      <NA> MULTIPOLYGON (((-0.404072 5...
8      <NA> MULTIPOLYGON (((-0.404072 5...
9      <NA> MULTIPOLYGON (((-0.1965687 ...
10     <NA> MULTIPOLYGON (((-0.1998964 ...

We can see this dataset contains 33 features and 7 fields, in lat/lon coordinates too.

ldn dataset

The boroughs of London

Description : Polygons representing large administrative zones in London

Format:

NAME Borough name
GSS_CODE Official code
HECTARES How many hectares
NONLD_AREA Area outside London
ONS_INNER Office for national statistics code
SUB_2009 Empty column
SUB_2006 Empty column
geometry sfc_MULTIPOLYGON

Source : https://github.com/Robinlovelace/Creating-maps-in-R

In order to ease spatial calculations, let’s reproject them.

4.2 Reprojection

The Ordnance Survey National Grid is the official one for Great Britain. Its SRID is EPSG:27700.

cycle_hire_27700 <- cycle_hire %>%
  st_transform(crs = st_crs(27700))

london_27700 <- lnd %>%
  st_transform(crs = st_crs(27700))

TO reproject, we used 2 functions: - st_transform() for the reprojection - st_crs() to get CRS definition from EPSG code

We also use the pipe operator : %>%, it is useful to pipe data to another function. This is provided by the {magrittr} package through {dplyr} and {sf}.

Now, we can create a quick map to visualize our data. We can use the plot() function to do this. This function is part of base R.

plot(london_27700$geometry) # we just want to plot the geometry column
plot(cycle_hire_27700$geometry, 
 col = "red",  # color
 cex = 0.5,    # size of symbol
 add = TRUE)   # important parameter to create multilayer plots

4.3 Joins

We can use two ways to link those datasets together, by attributes, they share their area name (area and NAME) or spatially. For the sake of the exercice, let’s do both.

4.3.1 Join by attributes

Let’s join them with a inner join to see how many have corresponding

cycle_hire_27700 %>% inner_join(
  london_27700 %>%
    st_drop_geometry(), # we don't need the geometry here
  by = c( "area" = "NAME")
  
)

Simple feature collection with 33 features and 12 fields
geometry type:  POINT
dimension:      XY
bbox:           xmin: 523978 ymin: 174408 xmax: 537912.7 ymax: 182972.5
epsg (SRID):    27700
proj4string:    +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs
First 10 features:
    id                      name        area nbikes nempty slots  GSS_CODE
1   80             Webber Street   Southwark     10     14    24 E09000028
2  108      Abbey Orchard Street Westminster      1     27    28 E09000033
3  118             Rochester Row Westminster     12      1    13 E09000033
4  221           Horseferry Road Westminster     12      3    15 E09000033
5  240            Colombo Street   Southwark      9      5    14 E09000028
6  259 Embankment (Horse Guards) Westminster      0     29    29 E09000033
7  267            Regency Street Westminster      7     12    19 E09000033
8  281              Smith Square Westminster      4     12    16 E09000033
9  299            Vincent Square Westminster      6     11    17 E09000033
10 302               Putney Pier  Wandsworth     18     10    28 E09000032
   HECTARES NONLD_AREA ONS_INNER SUB_2009 SUB_2006
1  2991.340    105.139         T     <NA>     <NA>
2  2203.005     54.308         T     <NA>     <NA>
3  2203.005     54.308         T     <NA>     <NA>
4  2203.005     54.308         T     <NA>     <NA>
5  2991.340    105.139         T     <NA>     <NA>
6  2203.005     54.308         T     <NA>     <NA>
7  2203.005     54.308         T     <NA>     <NA>
8  2203.005     54.308         T     <NA>     <NA>
9  2203.005     54.308         T     <NA>     <NA>
10 3522.022     95.600         T     <NA>     <NA>
                    geometry
1  POINT (531832.1 179680.3)
2  POINT (529756.5 179341.1)
3  POINT (529528.7 179079.5)
4      POINT (529880 178976)
5  POINT (531568.5 180203.8)
6  POINT (530351.7 180115.2)
7  POINT (529765.2 178666.4)
8  POINT (530077.3 179091.2)
9  POINT (529433.2 178872.2)
10   POINT (523978 175723.2)

We can see that only 33 features matched. That’s poor, let’s try this spatially.

4.3.2 Spatial join

For this, we will try to provide a GSS_CODE for all cycle hire points. We will regroup the data afterwards.

For this, we will select only the GSS_CODE column from london_27700 with the select function from {dplyr}, the geometry will follow.

cycle_hire_27700 <- cycle_hire_27700 %>% st_join(london_27700 %>% select(GSS_CODE))

Now if we look at our dataset, there is a GSS_CODE column.

names(cycle_hire_27700)

[1] "id"       "name"     "area"     "nbikes"   "nempty"   "slots"   
[7] "GSS_CODE" "geometry"

How many points doesn’t have a GSS_code ?

cycle_hire_27700 %>% filter(is.na(GSS_CODE))

Simple feature collection with 1 feature and 7 fields
geometry type:  POINT
dimension:      XY
bbox:           xmin: 534188 ymin: 180210.4 xmax: 534188 ymax: 180210.4
epsg (SRID):    27700
proj4string:    +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs
   id                name    area nbikes nempty slots GSS_CODE
1 134 Wapping High Street Wapping     20      0    20     <NA>
                 geometry
1 POINT (534188 180210.4)

Only one, that’s more better than before ! I don’t know well enough London to fix this. But that is not blocking.

Now to paraphrase Anita Graser : (“Aggregate all the things!” 2017)

4.4 Aggregation

4.4.1 Count

cycle_hire_by_area <- cycle_hire_27700 %>%
  filter(!is.na(GSS_CODE)) %>% # remove NAs
  st_drop_geometry() %>% # let's put geometry aside
  group_by(GSS_CODE) %>%  # group data by GSS_CODE
  tally(name = "count", sort= TRUE) # Aggregate
cycle_hire_by_area

# A tibble: 11 x 2
   GSS_CODE  count
   <fct>     <int>
 1 E09000033   171
 2 E09000030   117
 3 E09000020    90
 4 E09000032    59
 5 E09000013    58
 6 E09000007    57
 7 E09000022    46
 8 E09000028    40
 9 E09000019    37
10 E09000001    35
11 E09000012    31

4.4.2 Sum

cycle_hire_by_area_sum <- cycle_hire_27700 %>%
  filter(!is.na(GSS_CODE)) %>% # remove NAs
  st_drop_geometry() %>% # let's put geometry aside
  group_by(GSS_CODE) %>%  # group data by GSS_CODE
  summarise(sum = sum(nbikes), count = n()) # Aggregate
cycle_hire_by_area_sum

# A tibble: 11 x 3
   GSS_CODE    sum count
   <fct>     <int> <int>
 1 E09000001    46    35
 2 E09000007   518    57
 3 E09000012   606    31
 4 E09000013   754    58
 5 E09000019   398    37
 6 E09000020   872    90
 7 E09000022   871    46
 8 E09000028   756    40
 9 E09000030  1795   117
10 E09000032  1083    59
11 E09000033  1336   171

We could have use the base function aggregate() which works with {sf} objects.

aggregate(cycle_hire_27700["nbikes"], by = list(cycle_hire_27700$"GSS_CODE"),
                       FUN = sum, na.rm = TRUE)

Simple feature collection with 11 features and 2 fields
Attribute-geometry relationship: 0 constant, 1 aggregate, 1 identity
geometry type:  MULTIPOINT
dimension:      XY
bbox:           xmin: 522502 ymin: 174408 xmax: 538733.2 ymax: 184421
epsg (SRID):    27700
proj4string:    +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs
First 10 features:
     Group.1 nbikes                       geometry
1  E09000001     46 MULTIPOINT (531116.1 181358...
2  E09000007    518 MULTIPOINT (528392.6 183623...
3  E09000012    606 MULTIPOINT (532312.6 183064...
4  E09000013    754 MULTIPOINT (522502 178727, ...
5  E09000019    398 MULTIPOINT (530339.9 183400...
6  E09000020    872 MULTIPOINT (523651 180842, ...
7  E09000022    871 MULTIPOINT (528837 176040, ...
8  E09000028    756 MULTIPOINT (531235.8 179111...
9  E09000030   1795 MULTIPOINT (533364.4 181747...
10 E09000032   1083 MULTIPOINT (523978 175723.2...

If we want to represents our data with proportionnal symbols, we might want to have centroids. {sf} provides two functions in order to do that:

st_centroid()
st_point_on_surface()

st_point_on_surface() provides a point randomly in the entry shape. That can be useful for irregular shapes where the centroid might be outside the shape.

4.5 Centroids

boroughs_centroids <- london_27700 %>%
  select(NAME, GSS_CODE) %>% # only keep useful columns
  st_centroid()

You can also do buffers and other geometrical operations like st_union() to merge geometries

Spatial equivalents of logical operators (Lovelace, Nowosad, and Muenchow 2019)

4.6 Geometric binary predicates

{sf} provides numerous geometric binary predicates that can be used with the intersection function.

st_intersects()
st_disjoint()
st_touches()
st_crosses()
st_within()
st_contains()
st_contains_properly()
st_overlaps()
st_equals()
st_covers()
st_covered_by()
st_equals_exact()
st_is_within_distance()

You can use it alone or with ̀st_join()`.

For example, if we want to the cycle hires contained in the borough of Wandsworth, we will do like this.

london_27700 %>% 
  filter(NAME == "Wandsworth") %>% 
  st_contains(cycle_hire_27700)

Sparse geometry binary predicate list of length 1, where the predicate was `contains'
 1: 293, 576, 581, 587, 588, 590, 592, 594, 596, 598, ...

That will return a list of cycle hire points id.

In contrary, if we want to find in which borough the hire point with id 614 is we need to do this :

cycle_hire_27700 %>% filter(id == "614") %>% 
  st_within(london_27700) # borough at index 22

Sparse geometry binary predicate list of length 1, where the predicate was `within'
 1: 22

To get the borough data, there is some more work to do.

london_27700[unlist(cycle_hire_27700 %>% filter(id == "614") %>% st_within(london_27700)),]

Simple feature collection with 1 feature and 7 fields
geometry type:  MULTIPOLYGON
dimension:      XY
bbox:           xmin: 521054.9 ymin: 170381.5 xmax: 530193.7 ymax: 177893.4
epsg (SRID):    27700
proj4string:    +proj=tmerc +lat_0=49 +lon_0=-2 +k=0.9996012717 +x_0=400000 +y_0=-100000 +ellps=airy +towgs84=446.448,-125.157,542.06,0.15,0.247,0.842,-20.489 +units=m +no_defs
         NAME  GSS_CODE HECTARES NONLD_AREA ONS_INNER SUB_2009 SUB_2006
22 Wandsworth E09000032 3522.022       95.6         T     <NA>     <NA>
                         geometry
22 MULTIPOLYGON (((523489.6 17...

4.7 Saving results

In the first part, we saw that we can read data but we can also write it !

4.7.1 Writing data

To write data, we will use the st_write() function. It takes the data source name (dsn) as mandatory argument, {sf} will try to find the good driver from the extension (here it is gpkg for GeoPackage).

st_write() can’t save non geospatial data. So we need to join the data from cycle_hire_by_area_sum to the boroughs first.

As we want to save it to GeoPackage², we also need to provide a layer name : london_boroughs_27700. Repeat for all data you want to save.

london_27700 %>% left_join(cycle_hire_by_area_sum) %>%
  st_write(
    dsn = here("foss4g_R_workshop.gpkg"),
    layer = "london_boroughs_27700", 
    layer_options = "OVERWRITE=true")

Updating layer `london_boroughs_27700' to data source `D:/Roelandt/PERSONNEL/FOSS4G2019_Geoprocessing_with_R_workshop/foss4g_R_workshop.gpkg' using driver `GPKG'
options:        OVERWRITE=true 
features:       33
fields:         9
geometry type:  Multi Polygon

boroughs_centroids %>%
  left_join(cycle_hire_by_area_sum) %>%
  st_write(
    dsn = here("foss4g_R_workshop.gpkg"),
    layer = "boroughs_centroids_27700", 
    layer_options = "OVERWRITE=true")

Updating layer `boroughs_centroids_27700' to data source `D:/Roelandt/PERSONNEL/FOSS4G2019_Geoprocessing_with_R_workshop/foss4g_R_workshop.gpkg' using driver `GPKG'
options:        OVERWRITE=true 
features:       33
fields:         4
geometry type:  Point

cycle_hire_27700 %>%
  left_join(cycle_hire_by_area_sum) %>%
  st_write(
    dsn = here("foss4g_R_workshop.gpkg"),
    layer = "cycle_hire_27700",
    layer_options = "OVERWRITE=true")

Updating layer `cycle_hire_27700' to data source `D:/Roelandt/PERSONNEL/FOSS4G2019_Geoprocessing_with_R_workshop/foss4g_R_workshop.gpkg' using driver `GPKG'
options:        OVERWRITE=true 
features:       742
fields:         9
geometry type:  Point

We used the here() function as it preserve the project file hierarchy. It works better in Rstudio but it is still useful with Jupyter notebooks.

The data set where joined by their GSS_CODE. You can specify the “by” statement, but for the sake of readability, it is not show here.

The layer_options = "OVERWRITE=true" ensure you can write on existing layer, it is optionnal.

print(here())  # print the project directory

[1] "D:/Roelandt/PERSONNEL/FOSS4G2019_Geoprocessing_with_R_workshop"

list.files(here()) # list the files in the project directory

 [1] "_bookdown.yml"                                 
 [2] "_bookdown_files"                               
 [3] "_output.yml"                                   
 [4] "01-intro.Rmd"                                  
 [5] "02-R_basics.Rmd"                               
 [6] "03-geoprocessing.Rmd"                          
 [7] "04-maps.Rmd"                                   
 [8] "05-summary.Rmd"                                
 [9] "06-references.Rmd"                             
[10] "book.bib"                                      
[11] "docs"                                          
[12] "foss4g_R_workshop.gpkg"                        
[13] "FOSS4G2019_Geoprocessing_with_R_workshop.Rmd"  
[14] "FOSS4G2019_Geoprocessing_with_R_workshop.Rproj"
[15] "FOSS4G2019_Geoprocessing_with_R_workshop_cache"
[16] "FOSS4G2019_Geoprocessing_with_R_workshop_files"
[17] "images"                                        
[18] "index.Rmd"                                     
[19] "LICENCE.md"                                    
[20] "packages.bib"                                  
[21] "preamble.tex"                                  
[22] "README.md"                                     
[23] "style.css"

4.7.2 Check data

{sf} provides a st_layers() function that is useful to see the content of a dataset.

st_layers(dsn = here("foss4g_R_workshop.gpkg"))

Driver: GPKG 
Available layers:
                layer_name geometry_type features fields
1    london_boroughs_27700 Multi Polygon       33      9
2 boroughs_centroids_27700         Point       33      4
3         cycle_hire_27700         Point      742      9

Now that we have data, let’s do some maps on it in the next chapter !

References

“Aggregate all the things!” 2017. https://anitagraser.com/2017/06/08/aggregate-all-the-things-qgis-expression-edition/.

Lovelace, Robin, Jakub Nowosad, and Jannes Muenchow. 2019. Geocomputation with R. CRC Press.

More details on the datasets here : https://cran.r-project.org/web/packages/spData/spData.pdf ↩
Because GeoPackage are cool !↩