Chapter 1 Getting Started

This chapter gives a quick start to get you going with spatial data science with R. It is easier to read when understanding R at the level of, say, R for Data Science (Wickham and Grolemund 2017).

1.1 A first map

There is a lot to say about spatial data, but let us first create a map. We can create a simple map by:

library(tidyverse)
#> ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 3.1.0          ✔ purrr   0.2.5.9000
#> ✔ tibble  2.0.0          ✔ dplyr   0.8.0.9000
#> ✔ tidyr   0.8.2          ✔ stringr 1.3.1     
#> ✔ readr   1.2.1          ✔ forcats 0.3.0
#> ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
library(sf)
#> Linking to GEOS 3.7.0, GDAL 2.3.2, PROJ 5.2.0
system.file("gpkg/nc.gpkg", package="sf") %>%
    read_sf() %>%
    st_transform(32119) %>%
    select(BIR74) %>%
    plot(graticule = TRUE, axes = TRUE)
a first map

Figure 1.1: a first map

A lot went on, here. We will describe the steps in detail. First, we loaded two R packages:

library(tidyverse)
library(sf)

where tidyverse is needed for the tidyverse functions and methods, and sf is needed for the spatial commands and spatial tidyverse methods. Package sf implements simple features, a standardised way to encode vector data (points, lines, polygons). We will say more about simple features in chapter 3. Most commands in package sf start with st_, short for spatiotemporal, a convention it shares with e.g. PostGIS.

The %>% (pipe) symbols should be read as then: we read

a %>% b() %>% c() %>% d(n = 10)

as with a do b then c then d, and that is just alternative syntax for

d(c(b(a)), n = 10)

or

tmp1 <- b(a)
tmp2 <- c(tmp1)
tmp3 <- d(tmp2, n = 10)

To many, the pipe-form is easier to read because execution order follows reading order (from left to right). Like nested function calls, it avoids the need to choose names for intermediate results.

For the illustration we picked a data file that comes with sf, the location of which depends on the operating system used. The following will give a different output on your computer:

(file <- system.file("gpkg/nc.gpkg", package="sf"))
#> [1] "/home/edzer/R/x86_64-pc-linux-gnu-library/3.5/sf/gpkg/nc.gpkg"

Never use system.file if you want to read your own data; in that case, fname should be the data source (typically file or path) name (section 1.2). (Parens around this expression are used to have the result not only stored, but also printed.)

Then, we read this file into R using read_sf:

(file %>% read_sf()  -> nc)
#> Simple feature collection with 100 features and 14 fields
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: -84.3 ymin: 33.9 xmax: -75.5 ymax: 36.6
#> epsg (SRID):    4267
#> proj4string:    +proj=longlat +datum=NAD27 +no_defs
#> # A tibble: 100 x 15
#>    AREA PERIMETER CNTY_ CNTY_ID NAME  FIPS  FIPSNO CRESS_ID BIR74 SID74
#>   <dbl>     <dbl> <dbl>   <dbl> <chr> <chr>  <dbl>    <int> <dbl> <dbl>
#> 1 0.114      1.44  1825    1825 Ashe  37009  37009        5  1091     1
#> 2 0.061      1.23  1827    1827 Alle… 37005  37005        3   487     0
#> 3 0.143      1.63  1828    1828 Surry 37171  37171       86  3188     5
#> 4 0.07       2.97  1831    1831 Curr… 37053  37053       27   508     1
#> 5 0.153      2.21  1832    1832 Nort… 37131  37131       66  1421     9
#> 6 0.097      1.67  1833    1833 Hert… 37091  37091       46  1452     7
#> # … with 94 more rows, and 5 more variables: NWBIR74 <dbl>, BIR79 <dbl>,
#> #   SID79 <dbl>, NWBIR79 <dbl>, geom <MULTIPOLYGON [°]>

which creates a “spatial tibble”:

class(nc)
#> [1] "sf"         "tbl_df"     "tbl"        "data.frame"

This object is transformed into a new coordinate reference system (North Carolina State Plane, with EPSG code 32119):

(nc %>% st_transform(32119) -> nc.32119)
#> Simple feature collection with 100 features and 14 fields
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: 124000 ymin: 14700 xmax: 931000 ymax: 318000
#> epsg (SRID):    32119
#> proj4string:    +proj=lcc +lat_1=36.16666666666666 +lat_2=34.33333333333334 +lat_0=33.75 +lon_0=-79 +x_0=609601.22 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
#> # A tibble: 100 x 15
#>    AREA PERIMETER CNTY_ CNTY_ID NAME  FIPS  FIPSNO CRESS_ID BIR74 SID74
#>   <dbl>     <dbl> <dbl>   <dbl> <chr> <chr>  <dbl>    <int> <dbl> <dbl>
#> 1 0.114      1.44  1825    1825 Ashe  37009  37009        5  1091     1
#> 2 0.061      1.23  1827    1827 Alle… 37005  37005        3   487     0
#> 3 0.143      1.63  1828    1828 Surry 37171  37171       86  3188     5
#> 4 0.07       2.97  1831    1831 Curr… 37053  37053       27   508     1
#> 5 0.153      2.21  1832    1832 Nort… 37131  37131       66  1421     9
#> 6 0.097      1.67  1833    1833 Hert… 37091  37091       46  1452     7
#> # … with 94 more rows, and 5 more variables: NWBIR74 <dbl>, BIR79 <dbl>,
#> #   SID79 <dbl>, NWBIR79 <dbl>, geom <MULTIPOLYGON [m]>

and a single attribute column is selected

(nc.32119 %>% select(BIR74) -> nc.32119.bir74)
#> Simple feature collection with 100 features and 1 field
#> geometry type:  MULTIPOLYGON
#> dimension:      XY
#> bbox:           xmin: 124000 ymin: 14700 xmax: 931000 ymax: 318000
#> epsg (SRID):    32119
#> proj4string:    +proj=lcc +lat_1=36.16666666666666 +lat_2=34.33333333333334 +lat_0=33.75 +lon_0=-79 +x_0=609601.22 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
#> # A tibble: 100 x 2
#>   BIR74                                                                geom
#>   <dbl>                                                  <MULTIPOLYGON [m]>
#> 1  1091 (((387345 278387, 381334 282774, 379438 282943, 373250 290553, 363…
#> 2   487 (((408602 292425, 408565 293985, 406643 296873, 406420 3e+05, 4023…
#> 3  3188 (((478717 277490, 476936 278867, 471503 279173, 470806 281394, 469…
#> 4   508 (((878194 289128, 877381 291117, 875994 290881, 874941 292805, 870…
#> 5  1421 (((769835 277796, 768364 274842, 762616 274401, 763168 269009, 761…
#> 6  1452 (((812328 277876, 791158 277012, 789882 277579, 777724 277107, 769…
#> # … with 94 more rows

Finally, the result is plotted, with the command:

nc.32119.bir74 %>% plot(graticule = TRUE, axes = TRUE)

as shown in figure 1.1.

Splitting up the steps lets us see what is happening, as errors from combined steps can be hard to assign to the right step. Repeating the same sequence, but with the wrong argument to st_transform gives:

system.file("gpkg/nc.gpkg", package="sf") %>%
    read_sf() %>%
    st_transform(32999) %>%
    select(BIR74) %>%
    plot(graticule = TRUE, axes = TRUE)
#> Warning in CPL_crs_from_epsg(as.integer(x)): GDAL Error 6: EPSG PCS/GCS
#> code 32999 not found in EPSG support files. Is this a valid EPSG coordinate
#> system?
#> OGR: Corrupt data

which is informative, but still needs more backtracking to find the cause than taking things step-by-step in the first instance. Again, mistyping a column name will cause an error:

system.file("gpkg/nc.gpkg", package="sf") %>%
    read_sf() %>%
    st_transform(32119) %>%
    select(BIR75) %>%
    plot(graticule = TRUE, axes = TRUE)
#> Error in .f(.x[[i]], ...) : object 'BIR75' not found

Where do these commands come from? library and system.file are base R. We can ask for help about a particular command by entering e.g.

?library

The command read_sf is an alternative to the st_read, which returns a spatial tibble instead of a spatial data frame, and will be discussed in section 1.2. The st_transform method is used here to convert from the geographic coordinates (degrees longitude and latitude) into “flat” coordinates, meaning \(x\) and \(y\) coordinates in a planar system. It will be discussed in section 8.1. The plot method for sf objects chooses default colors and legends settings; we instructed it to add a graticule (the grey lines of equal longitude and latitude) and degree labels along the axes. It is described in chapter 9.

As witnessed by the plot, the plot command receives county polygons as well as BIR74 values for each polygon. How is it possible that we select only the BIR74 variable, but still can plot the polygons? This is because package sf provides a select method:

methods(select)
#> [1] select.data.frame* select.default*    select.grouped_df*
#> [4] select.list        select.sf*         select.tbl_cube*  
#> see '?methods' for accessing help and source code

and this method (select.sf) makes the geometry (geom) sticky:

nc %>% select(BIR74) %>% names()
#> [1] "BIR74" "geom"

In sp, select methods using [ were sticky. We get the “normal” select behaviour if we first coerce to a normal tibble:

nc %>% as_tibble(validate = TRUE) %>% select(BIR74) %>% names()
#> The `validate` argument to `as_tibble()` is deprecated. Please use `.name_repair` to control column names.
#> [1] "BIR74"

A ggplot is created when we use geom_sf:

ggplot() + geom_sf(data = nc.32119) + aes(fill = BIR74) +
    theme(panel.grid.major = element_line(color = "white")) +
    scale_fill_gradientn(colors = sf.colors(20))
first ggplot

Figure 1.2: first ggplot

and a facet plot for a pair of columns in nc.32119 is obtained by gathering the columns:

nc.32119 %>% select(SID74, SID79) %>% gather(VAR, SID, -geom) -> nc2
ggplot() + geom_sf(data = nc2, aes(fill = SID)) + facet_wrap(~VAR, ncol = 1) +
  scale_y_continuous(breaks = 34:36) +
  scale_fill_gradientn(colors = sf.colors(20)) +
  theme(panel.grid.major = element_line(color = "white"))

An interactive, leaflet-type map is obtained by

suppressPackageStartupMessages(library(mapview))
nc.32119 %>% mapview(zcol = "BIR74", legend = TRUE, col.regions = sf.colors)

1.2 Reading and writing

Typical R data science tasks start with reading data from an external source; this may be a file, or a set of files like a “shapefile”, or a database, or a web service. Package sf can read from a large number of different data source types, each having its own driver. The following commands show how many vector and raster drivers we have available:

st_drivers("vector") %>% nrow() # vector drivers
#> [1] 85
st_drivers("raster") %>% nrow() # raster drivers
#> [1] 138

(the output you see may differ because of different operating system and configuration; when the same version of GDAL is in use, the drivers available in sf are the same as in rgdal.)

1.2.1 GDAL

st_drivers lists the drivers available to GDAL, the geospatial data abstraction library. This library can be seen as the Swiss army knive of spatial data; besides for R it is being used in Python, QGIS, PostGIS, and more than 100 other software projects. The dependency of sf on other R packages and system libraries is shown in figure 1.3.

sf and its dependencies; arrows indicate strong dependency, dashed arrows weak dependency

Figure 1.3: sf and its dependencies; arrows indicate strong dependency, dashed arrows weak dependency

Note that the C/C++ libraries used (GDAL, GEOS, PROJ, liblwgeom, udunits2) are all developed, maintained and used by data science communities that are large but different from the R community. By using these libraries, we share how we understand what we are doing with these other communities. Because R (and Python) provide interactive interfaces to this software, many R users get closer to these libraries than do users of other software based on these libraries. This is not only important for resolving problems, but also for reaching consensus on which findings are helpful.

GDAL is a “library of libraries” – in order to read all these data sources it needs a large number of other libraries. It typically links to over 100 other libraries. Binary packages distributed by CRAN contain only statically linked code: CRAN does not want to make any assumptions about presence of third-party libraries on the host system. As a consequence, when the sf package is installed in binary form from CRAN, it includes a copy of all the required external libraries as well as their dependencies, which may amount to 100 Mb.

1.2.2 st_read or read_sf?

The function to read vector data is st_read. Function read_sf is largely the same as `st_read, but chooses a few tidyverse-style defaults:

  • it is silent by default, where st_read gives a short report
  • it returns a spatial tibble instead of a spatial data frame
  • it sets as default stringsAsFactors = FALSE, where st_read listens to the global option default.stringsAsFactors() (which is TRUE by default)
  • it accepts list-columns as input

In the same fashion, compared to st_write, function write_sf,

  • is also silent
  • overwrites layers (i.e., sets delete_layer = TRUE) by default, which st_write does not do.

1.2.3 reading and writing raster data

Raster data can be read with function read_stars from package stars

library(stars)
tif = system.file("tif/L7_ETMs.tif", package = "stars")
(x = tif %>% read_stars())
#> stars object with 3 dimensions and 1 attribute
#> attribute(s):
#>   L7_ETMs.tif   
#>  Min.   :  1.0  
#>  1st Qu.: 54.0  
#>  Median : 69.0  
#>  Mean   : 68.9  
#>  3rd Qu.: 86.0  
#>  Max.   :255.0  
#> dimension(s):
#>      from  to  offset delta                       refsys point values    
#> x       1 349  288776  28.5 +proj=utm +zone=25 +south... FALSE   NULL [x]
#> y       1 352 9120761 -28.5 +proj=utm +zone=25 +south... FALSE   NULL [y]
#> band    1   6      NA    NA                           NA    NA   NULL

Plotting this object shows the six different spectral bands read, with color breaks based on quantiles of pixel values accross all bands:

plot(x)

Similarly, we can write raster data with st_write

tif_file = paste0(tempfile(), ".tif")
st_write(x, tif_file)

We can read back the raster metadata (its dimensions and reference system, but not the actual pixel values) by

read_stars(tif_file, proxy = TRUE)
#> stars_proxy object with 1 attribute in file:
#> $file62ec22f89123.tif
#> [1] "/tmp/RtmpzdxmwG/file62ec22f89123.tif"
#> 
#> dimension(s):
#>      from  to  offset delta                       refsys point values    
#> x       1 349  288776  28.5 +proj=utm +zone=25 +south... FALSE   NULL [x]
#> y       1 352 9120761 -28.5 +proj=utm +zone=25 +south... FALSE   NULL [y]
#> band    1   6      NA    NA                           NA    NA   NULL

Raster data analysis and its integration with vector data is explained in detail in chapter 4.

1.2.4 Reading from files, and legacy shapefiles

We saw above that a spatial dataset can be read from a single file by

system.file("gpkg/nc.gpkg", package="sf") %>% read_sf() -> nc

In some cases, spatial datasets are contained in multiple files, e.g. in the case of shapefiles. A “shapefile” should be really understood as a set of files with a common prefix, or even a directory with several of such sets.

Package sf comes with a couple of shapefiles packaged, a directory listing of the shape directory in the packge is obtained by

list.files(system.file("shape/", package = "sf"))
#>  [1] "nc.dbf"                  "nc.prj"                 
#>  [3] "nc.shp"                  "nc.shx"                 
#>  [5] "olinda1.dbf"             "olinda1.prj"            
#>  [7] "olinda1.shp"             "olinda1.shx"            
#>  [9] "storms_xyz_feature.dbf"  "storms_xyz_feature.shp" 
#> [11] "storms_xyz_feature.shx"  "storms_xyz.dbf"         
#> [13] "storms_xyz.shp"          "storms_xyz.shx"         
#> [15] "storms_xyzm_feature.dbf" "storms_xyzm_feature.shp"
#> [17] "storms_xyzm_feature.shx" "storms_xyzm.dbf"        
#> [19] "storms_xyzm.shp"         "storms_xyzm.shx"

We can read a single shapefile by

system.file("shape/nc.shp", package="sf") %>% read_sf() -> nc

and it is important to know that in that case all four files starting with nc are read from this directory.

We can also read the directory with shapefiles by

system.file("shape", package="sf") %>% read_sf() -> something
#> Warning in evalq((function (..., call. = TRUE, immediate. = FALSE,
#> noBreaks. = FALSE, : automatically selected the first layer in a data
#> source containing more than one.

but we see some warnings now, indicating that we are reading only the first layer from a multi-layer dataset (and not nc.shp!). Indeed, this directory contains multiple layers, which can be queried by

system.file("shape", package="sf") %>% st_layers()
#> Driver: ESRI Shapefile 
#> Available layers:
#>            layer_name        geometry_type features fields
#> 1 storms_xyzm_feature Measured Line String       71      1
#> 2          storms_xyz       3D Line String       71      0
#> 3                  nc              Polygon      100     14
#> 4  storms_xyz_feature       3D Line String       71      1
#> 5             olinda1              Polygon      470      6
#> 6         storms_xyzm Measured Line String       71      0

From this list, we could pick one, and use it as the layer argument, as in

dataset <- system.file("shape", package="sf")
layer <- "nc"
nc <- read_sf(dataset, layer)

which is essentially a convoluted way of what we did before to read nc.shp.

Considering shapefiles in directories as layers in a dataset is not something that sf came up with, but is the way GDAL handles this. Although it is a good idea in general to give up using shapefiles, we cannot always control the format of the spatial data we get to start with.

1.2.5 Reading from a text string

In the special case of a GeoJSON (Butler et al. 2016) dataset, when the dataset is contained in a length-one character vector, it can be directly passed to read_sf and read from memory:

str <- '{
  "type": "FeatureCollection",
  "features": [
    { "type": "Feature",
      "geometry": {
        "type": "Point",
        "coordinates": [102.0, 0.5]
      },
      "properties": {
        "prop0": "value0"
      }
    },
    { "type": "Feature",
      "geometry": {
        "type": "LineString",
        "coordinates": [
          [102.0, 0.0], [103.0, 1.0], [104.0, 0.0], [105.0, 1.0]
        ]
      },
      "properties": {
        "prop0": "value0",
        "prop1": 0.0
      }
    },
    { "type": "Feature",
      "geometry": {
        "type": "Polygon",
        "coordinates": [
          [
            [100.0, 0.0], [101.0, 0.0], [101.0, 1.0],
            [100.0, 1.0], [100.0, 0.0]
          ]
        ]
      },
      "properties": {
        "prop0": "value0",
        "prop1": { "this": "that" }
      }
    }
  ]
}'
(sf_obj <- read_sf(str))
#> Simple feature collection with 3 features and 2 fields
#> geometry type:  GEOMETRY
#> dimension:      XY
#> bbox:           xmin: 100 ymin: 0 xmax: 105 ymax: 1
#> epsg (SRID):    4326
#> proj4string:    +proj=longlat +datum=WGS84 +no_defs
#> # A tibble: 3 x 3
#>   prop0  prop1                                                     geometry
#>   <chr>  <chr>                                               <GEOMETRY [°]>
#> 1 value0 <NA>                                               POINT (102 0.5)
#> 2 value0 0.0                        LINESTRING (102 0, 103 1, 104 0, 105 1)
#> 3 value0 "{ \"this\": \"that\"… POLYGON ((100 0, 101 0, 101 1, 100 1, 100 …

1.2.6 Database

Data can be read from a spatial database directly through two paths. The first is to use the standard database interface of R (DBI), for instance with a SQLITE database:

library(RSQLite)
db = system.file("sqlite/meuse.sqlite", package = "sf")
dbcon <- dbConnect(dbDriver("SQLite"), db)
(s = st_read(dbcon, "meuse.sqlite"))[1:3,]
#> Simple feature collection with 3 features and 13 fields
#> geometry type:  POINT
#> dimension:      XY
#> bbox:           xmin: 181000 ymin: 334000 xmax: 181000 ymax: 334000
#> epsg (SRID):    NA
#> proj4string:    NA
#>   ogc_fid cadmium copper lead zinc elev    dist   om ffreq soil lime
#> 1       1    11.7     85  299 1022 7.91 0.00136 13.6     1    1    1
#> 2       2     8.6     81  277 1141 6.98 0.01222 14.0     1    1    1
#> 3       3     6.5     68  199  640 7.80 0.10303 13.0     1    1    1
#>   landuse dist.m              GEOMETRY
#> 1      Ah     50 POINT (181072 333611)
#> 2      Ah     30 POINT (181025 333558)
#> 3      Ah    150 POINT (181165 333537)
dbDisconnect(dbcon)

Another way is to use GDAL database drivers, e.g. by

st_read(db)[1:3,]
#> Reading layer `meuse.sqlite' from data source `/home/edzer/R/x86_64-pc-linux-gnu-library/3.5/sf/sqlite/meuse.sqlite' using driver `SQLite'
#> Simple feature collection with 155 features and 12 fields
#> geometry type:  POINT
#> dimension:      XY
#> bbox:           xmin: 179000 ymin: 330000 xmax: 181000 ymax: 334000
#> epsg (SRID):    28992
#> proj4string:    +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +towgs84=565.4171,50.3319,465.5524,-0.398957,0.343988,-1.87740,4.0725 +units=m +no_defs
#> Simple feature collection with 3 features and 12 fields
#> geometry type:  POINT
#> dimension:      XY
#> bbox:           xmin: 181000 ymin: 334000 xmax: 181000 ymax: 334000
#> epsg (SRID):    28992
#> proj4string:    +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +towgs84=565.4171,50.3319,465.5524,-0.398957,0.343988,-1.87740,4.0725 +units=m +no_defs
#>   cadmium copper lead zinc elev    dist   om ffreq soil lime landuse
#> 1    11.7     85  299 1022 7.91 0.00136 13.6     1    1    1      Ah
#> 2     8.6     81  277 1141 6.98 0.01222 14.0     1    1    1      Ah
#> 3     6.5     68  199  640 7.80 0.10303 13.0     1    1    1      Ah
#>   dist.m              GEOMETRY
#> 1     50 POINT (181072 333611)
#> 2     30 POINT (181025 333558)
#> 3    150 POINT (181165 333537)

An advantage of the former approach may be that any query can be passed, allowing for reading only parts of a table into R’s memory.

1.3 Exercises

  1. Read the shapefile storms_xyz_feature from the shape directory in the sf package
  2. Copy this file to another directory on your computer, and read it from there (note: a shapefile consists of more than one file!)
  3. How many features does this dataset contain?
  4. Plot the dataset, with axes = TRUE (hint: before plotting, pipe through st_zm to drop Z and M coordinates; more about this in chapter 3).
  5. Before plotting, pipe the dataset through st_set_crs(4326). What is different in the plot obtained?

References

Wickham, Hadley, and Garret Grolemund. 2017. R for Data Science. O’Reilly. http://r4ds.had.co.nz/.

Butler, H., M. Daly, A. Doyl, S. Gillies, S. Hagen, and T. Schaub. 2016. “The Geojson Format.” Vol. Request for Comments: 7946. Internet Engineering Task Force (IETF). https://tools.ietf.org/html/rfc7946.