Chapter 6 Feature attributes

Feature attributes refer to the properties of features (“things”) that do not describe the feature’s geometry. Feature attributes can be derived from geometry (e.g. length of a LINESTRING, area of a POLYGON) but they can also refer to completely different properties, such as

  • the name of a street or a county,
  • the number of people living in a country,
  • the type of a road
  • the soil type in a polygon from a soil map.
  • the opening hours of a shop
  • the body weight of an animal
  • the NO\(_2\) concentration measured at an air quality monitoring station

Although temporal properties of features are no less fundamental than their spatial properties, the simple feature access standard and consequently the sf package does not give time a similar role as space; more on that in chapter 4.

Most sf objects will contain both geometries and attributes for features. The geometric operations described in the previous chapter (5) operate on geometries only, and may occasionally add attributes, but will not modify attributes present.

In all these cases, attribute values remain unmodified. At first sight, that looks rather harmless. But if we look into a simple case of replacing a county boundary with a county centroid, as in

library(sf)
library(dplyr)
system.file("gpkg/nc.gpkg", package="sf") %>%
    read_sf() %>%
    st_transform(32119) %>%
    select(BIR74, SID74, NAME) %>%
    st_centroid() %>%
    head(n = 1) -> x # save as x
#> Warning in st_centroid.sf(.): st_centroid assumes attributes are constant
#> over geometries of x
st_geometry(x)[[1]]
#> POINT (385605 3e+05)

we receive a warning. This warning is justified for the first two variables shown (total births and number of SID disease cases, 1974) which, as such, are not associated with a feature whose geometry is POINT (385605.4 300303.5). The third variable, NAME is however still the county name for the point indicated, but the point geometry no longer is the county geometry.

6.1 Attribute-geometry relationships

Changing the feature geometry without changing the feature attributes does change the feature, since the feature is characterised by the combination of geometry and attributes. Can we, ahead of time, predict whether the resulting feature will still meaningfully relate to the attribute data when we replace all geometries for instance with their convex hull or centroid? It depends.

Take the example of a road, represented by a LINESTRING, which has an attribute property road width equal to 10 m. What can we say about the road width of an arbitray subsectin of this road? That depends on whether the attribute road length describes, for instance the road width everywhere, meaning that road width is constant along the road, or whether it describes an aggregate property, such as minimum or average road width. In case of the minimum, for an arbitrary subsection of the road one could still argue that the minimum road with must be at least as large as the minimum road width for the whole segment, but it may no longer be the minimum for that subsection. This gives us two “types” for the attribute-geometry relationship (AGR):

  • constant the attribute value is valid everywhere in or over the geometry
  • aggregate the attribute is an aggregate, a summary value over the geometry

For polygon data, typical examples of constant AGR are

  • land use for a land use polygon
  • rock units or geologic strata in a geological map
  • soil type in a soil map
  • elevation class in a elevation map that shows elevation as classes
  • climate zone in a climate zone map

Typical examples for the aggregate AGR are

  • population, either as number of persons or as population density
  • other socio-economic variables, summarised by area
  • total emission of pollutants by region
  • block mean NO\(_2\) concentrations, as e.g. obtained by block kriging or a dispersion model that predicts areal means

A third type of AGR is that where an attribute identifies a feature geometry. The example above is county NAME: the name identifies the county, and is still the county NAME for any sub-area.

  • identity the attribute value uniquely identifies the geometry as a whole, there are no other geometries with the same value

Arbitrary sub-areas will lose the identity property but becomes a constant attribute. An example is:

  • any point inside a county is still part of the county and must have the same value for county name, but it does not longer represent the (entire) geometry corresponding to that county.

We can specify the AGR of an attribute in an sf object by st_set_agr:

nc <- system.file("gpkg/nc.gpkg", package="sf") %>%
    read_sf() %>%
    st_transform(32119)
nc1 <- nc %>% select(BIR74, SID74, NAME) %>%
    st_set_agr(c(BIR74 = "aggregate", SID74 = "aggregate", NAME = "identity"))

This helps to get rid of warnings that a particular attribute is assumed to be constant over a geometry, if it already is. The following no longer generates a warning

nc1 %>% select(NAME) %>%
    st_centroid() %>%
    head(1)
#> Simple feature collection with 1 feature and 1 field
#> Attribute-geometry relationship: 1 constant, 0 aggregate, 0 identity
#> geometry type:  POINT
#> dimension:      XY
#> bbox:           xmin: 386000 ymin: 3e+05 xmax: 386000 ymax: 3e+05
#> epsg (SRID):    32119
#> proj4string:    +proj=lcc +lat_1=36.16666666666666 +lat_2=34.33333333333334 +lat_0=33.75 +lon_0=-79 +x_0=609601.22 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
#> # A tibble: 1 x 2
#>   NAME            geom
#>   <chr>    <POINT [m]>
#> 1 Ashe  (385605 3e+05)

and also changes AGR for NAME from identity to constant when replacing the geometry with the geometry’s centroid:

nc1 %>% select(NAME) %>%
    st_centroid() %>% 
    st_agr()
#>     NAME 
#> constant 
#> Levels: constant aggregate identity

Identifying attribute-geometry relationships, and warning against their absence is a first and simple implementation of the notion that the types of phenomena we encounter in spatial data science (like objects, fields, and aggregations) are not identified by their geometrical representations (points, lines, polygons, rasters). Making the wrong assumptions here easily leads to meaningless analysis results (Stasch et al. 2014,Scheider et al. (2016)).

6.2 Spatial join

Spatial joins are similar to regular (left or inner) joins, where the join criterion is not equality of one or more fields, but a spatial predicate, such as that two records have intersecting geometries. As an example, we can create a join between two tables,

a = st_sf(a = 1:2, geom = st_sfc(st_point(c(0,0)), st_point(c(1,1))))
b = st_sf(b = 3:4, geom = st_sfc(st_linestring(rbind(c(2,0), c(0,2))), st_point(c(1,1))))
st_join(a, b)
#> Simple feature collection with 3 features and 2 fields
#> geometry type:  POINT
#> dimension:      XY
#> bbox:           xmin: 0 ymin: 0 xmax: 1 ymax: 1
#> epsg (SRID):    NA
#> proj4string:    NA
#>     a  b        geom
#> 1   1 NA POINT (0 0)
#> 2   2  3 POINT (1 1)
#> 2.1 2  4 POINT (1 1)
st_join(a, b, left = FALSE)
#> Simple feature collection with 2 features and 2 fields
#> geometry type:  POINT
#> dimension:      XY
#> bbox:           xmin: 1 ymin: 1 xmax: 1 ymax: 1
#> epsg (SRID):    NA
#> proj4string:    NA
#>     a b        geom
#> 2   2 3 POINT (1 1)
#> 2.1 2 4 POINT (1 1)
st_join(b, a)
#> Simple feature collection with 2 features and 2 fields
#> geometry type:  GEOMETRY
#> dimension:      XY
#> bbox:           xmin: 0 ymin: 0 xmax: 2 ymax: 2
#> epsg (SRID):    NA
#> proj4string:    NA
#>   b a                  geom
#> 1 3 2 LINESTRING (2 0, 0 2)
#> 2 4 2           POINT (1 1)

We see that unless left = FALSE, we get all elements (and geometries) from the first argument, augmented with fields of the second argument when geometries match. The example shows the case where there are two geometries matching to point (1,1). The spatial join predicate function can be freely chosen, e.g. from the binary predicates listed in section 5.1.2.

When we match two sets of polygons, it may be a bit of a mess to go through all the many matches. One way out of this is to only provide the match with the largest overlap with the target geometry, obtained by adding argument largest = TRUE. An example of this is shown (visually) in figure 6.1.

example of st_join with largest = TRUE: the label of the polygon in the top figure with the largest intersection with polygons in the bottom figure is assigned to the polygons of the bottom figure

Figure 6.1: example of st_join with largest = TRUE: the label of the polygon in the top figure with the largest intersection with polygons in the bottom figure is assigned to the polygons of the bottom figure

6.3 Aggregate and summarise

Package sf provides sf methods for stats::aggregate and dplyr::summarise. Both do essentially the same:

  • given a grouping predicate (for summarise, obtained from group_by)
  • given an aggregation function
  • aggregate the selected attributes using this function, per group
  • aggregate in addition the geometries.
  • if do_union is TRUE (the default), union the aggregated geometries.

Unioning aggregated geometries dissolves for instance internal polygon boundaries, which otherwise would lead to invalid MULTIPOLYGON errors in subsequent analysis, or plotting of potentially unwanted internal polygon boundaries. Figure 6.2 illustrates this.

left: invalid `MULTIPOLYGON` with two external rings with common boundary, right: valid `POLYGON` obtained after unioning the geometry left

Figure 6.2: left: invalid MULTIPOLYGON with two external rings with common boundary, right: valid POLYGON obtained after unioning the geometry left

6.4 Intersections

Suppose we have two datasets with different geometries and attributes (left figure 6.3), and we want to compute their intersections:

p1 = st_polygon(list(rbind(c(0,0), c(4,0), c(4,4), c(0,4), c(0,0))))
d1 = st_sf(a = c(3,1), geom = st_sfc(p1, p1 + c(4, 0)))
d2 = st_sf(b = c(4), geom = st_sfc(p1 * .75 + c(3, 2)))

What will the intersection of these two objects give?

(i = st_intersection(d1, d2))
#> Warning: attribute variables are assumed to be spatially constant
#> throughout all geometries
#> Simple feature collection with 2 features and 2 fields
#> geometry type:  POLYGON
#> dimension:      XY
#> bbox:           xmin: 3 ymin: 2 xmax: 6 ymax: 4
#> epsg (SRID):    NA
#> proj4string:    NA
#>   a b                           geom
#> 1 3 4 POLYGON ((3 4, 4 4, 4 2, 3 ...
#> 2 1 4 POLYGON ((4 2, 4 4, 6 4, 6 ...
plot(d1, xlim = c(0,8), ylim = c(0, 6), col = NA, border = 1, reset = FALSE)
plot(d2, col = NA, border = 'red', add = TRUE, lwd = 2)
plot(d1, xlim = c(0,8), ylim = c(0, 6), col = NA, border = 1, lwd = 2, reset = FALSE)
plot(d2, col = NA, border = 'red', add = TRUE, lwd = 3)
plot(st_geometry(i), add = TRUE, col = grey(c(.7,.9)), , border = 'green', lwd = 1)
left: overlapping geometries (d2: red); right: intersection areas (i: grey)left: overlapping geometries (d2: red); right: intersection areas (i: grey)

Figure 6.3: left: overlapping geometries (d2: red); right: intersection areas (i: grey)

As we see, this gives the areas of intersection, along with the corresponding attributes for both contributing objects, and a warning that attributes were assumed to be spatially constant. Although this may be convenient in some cases, it may be entirely meaningless in others. For instance, in case attribute b in object d2 represents the number of people living in d2, then after the intersection we end up with twice as many people living over a smaller area.

As seen in section 5.5, computing intersections easily leads to errors caused by invalid geometries. Setting precision (section 5.4) may prevent this.

6.5 Area-weighted interpolation

Suppose we want to combine geometries and attributes of two datasets such, that we get attribute values of the first datasets summarised for the geometries of the second. There are various ways we can go for this. The simples one, building on the previous example, would be to obtain for the geometry of d2 the attribute of d1 that has the largest overlap with d2. This is obtained by

st_join(d2, d1, largest = TRUE)
#> Warning: attribute variables are assumed to be spatially constant
#> throughout all geometries
#> Simple feature collection with 1 feature and 2 fields
#> geometry type:  POLYGON
#> dimension:      XY
#> bbox:           xmin: 3 ymin: 2 xmax: 6 ymax: 5
#> epsg (SRID):    NA
#> proj4string:    NA
#>   b a                           geom
#> 1 4 1 POLYGON ((3 2, 6 2, 6 5, 3 ...

Another option would be to summarise the attribute, e.g. taking its mean, regardless the amount of overlap. This is obtained by

aggregate(d1, d2, mean)
#> Simple feature collection with 1 feature and 1 field
#> geometry type:  POLYGON
#> dimension:      XY
#> bbox:           xmin: 3 ymin: 2 xmax: 6 ymax: 5
#> epsg (SRID):    NA
#> proj4string:    NA
#>   a                       geometry
#> 1 2 POLYGON ((3 2, 6 2, 6 5, 3 ...

A third option is to apply area-weighted interpolation, meaning that we interpolate (average) the variable by taking into account the respective area contributions of overlap (Goodchild and Lam 1980,Do, Thomas-Agnan, and Vanhems (2015a),Do, Thomas-Agnan, and Vanhems (2015b)). This is done e.g. by

d3 = st_sfc(p1 * .75 + c(3, 2), p1 * .75 + c(3,3))
st_interpolate_aw(d1, d3, extensive = FALSE)$a
#> Warning in st_interpolate_aw(d1, d3, extensive = FALSE): st_interpolate_aw
#> assumes attributes are constant over areas of x
#> [1] 1.67 1.67
st_interpolate_aw(d1, d3, extensive = TRUE)$a
#> Warning in st_interpolate_aw(d1, d3, extensive = TRUE): st_interpolate_aw
#> assumes attributes are constant over areas of x
#> [1] 0.625 0.312

6.5.1 Spatially intensive and extensive variables

The difference between the two examples for area-weighted interpolation is how the final weighted sum (value times area of intersection) is normalised: by the target area (extensive), or by the sum of the area covered (intensive, extensive = FALSE). Spatially intensive variables are variables for which the value, when we split an area, does not principally change. An example might be temperature, elevation, or population density. Spatially extensive variables are variables for which the value is also split, according to the area. Examples are population (amount), or area.

6.6 Exercises

  • Add a variable to the nc dataset by nc$State = "North Carolina". Which value should you attach to this variable for the attribute-geometry relationship (agr)?
  • Create a new sf object from the geometry obtained by st_union(nc), and assign "North Carolina" to the variable State. Which agr can you now assign to this attribute variable?
  • Use st_area to add a variable with name area to nc. Compare the area and AREA variables in the nc dataset. What are the units of AREA? Are the two linearly related? If there are discrepancies, what could be the cause?
  • Is the area variable intensive or extensive? Is its agr equal to constant, identity or aggregate?
  • Find the name of the county that contains POINT(-78.34046 35.017)
  • Find the names of all counties with boundaries that touch county Sampson.
  • List the names of all counties that are less than 50 km away from county Sampson.

References

Stasch, Christoph, Simon Scheider, Edzer Pebesma, and Werner Kuhn. 2014. “Meaningful Spatial Prediction and Aggregation.” Environmental Modelling & Software 51: 149–65. doi:10.1016/j.envsoft.2013.09.006.

Scheider, Simon, Benedikt Gräler, Edzer Pebesma, and Christoph Stasch. 2016. “Modeling Spatiotemporal Information Generation.” International Journal of Geographical Information Science 30 (10). Taylor & Francis: 1980–2008. doi:10.1080/13658816.2016.1151520.

Goodchild, Michael F, and Nina Siu Ngan Lam. 1980. Areal Interpolation: A Variant of the Traditional Spatial Problem. Department of Geography, University of Western Ontario London, ON, Canada.

Do, Van Huyen, Christine Thomas-Agnan, and Anne Vanhems. 2015a. “Accuracy of Areal Interpolation Methods for Count Data.” Spatial Statistics 14: 412–38. doi:10.1016/j.spasta.2015.07.005.

Do, Van Huyen, Christine Thomas-Agnan, and Anne Vanhems. 2015b. “Spatial Reallocation of Areal Data: A Review.” Rev. Econ. Rég. Urbaine 1/2: 27–58. https://www.tse-fr.eu/sites/default/files/medias/doc/wp/mad/wp_tse_397_v2.pdf.