Chapter 6 Feature attributes
Feature attributes refer to the properties of features (“things”) that do not describe the feature’s geometry. Feature attributes can be derived from geometry (e.g. length of a
LINESTRING, area of a
POLYGON) but they can also refer to completely different properties, such as
- the name of a street or a county,
- the number of people living in a country,
- the type of a road
- the soil type in a polygon from a soil map.
- the opening hours of a shop
- the body weight of an animal
- the NO\(_2\) concentration measured at an air quality monitoring station
Although temporal properties of features are no less fundamental than their spatial properties, the simple feature access standard and consequently the
sf package does not give time a similar role as space; more on that in chapter 4.
sf objects will contain both geometries and attributes for features. The geometric operations described in the previous chapter (5) operate on geometries only, and may occasionally add attributes, but will not modify attributes present.
In all these cases, attribute values remain unmodified. At first sight, that looks rather harmless. But if we look into a simple case of replacing a county boundary with a county centroid, as in
library(sf) library(dplyr) system.file("gpkg/nc.gpkg", package="sf") %>% read_sf() %>% st_transform(32119) %>% select(BIR74, SID74, NAME) %>% st_centroid() %>% head(n = 1) -> x # save as x #> Warning in st_centroid.sf(.): st_centroid assumes attributes are constant #> over geometries of x st_geometry(x)[] #> POINT (385605 3e+05)
we receive a warning. This warning is justified for the first two variables shown (total births and number of SID disease cases, 1974) which, as such, are not associated with a feature whose geometry is
POINT (385605.4 300303.5). The third variable,
NAME is however still the county name for the point indicated, but the point geometry no longer is the county geometry.
6.1 Attribute-geometry relationships
Changing the feature geometry without changing the feature attributes does change the feature, since the feature is characterised by the combination of geometry and attributes. Can we, ahead of time, predict whether the resulting feature will still meaningfully relate to the attribute data when we replace all geometries for instance with their convex hull or centroid? It depends.
Take the example of a road, represented by a
LINESTRING, which has an attribute property road width equal to 10 m. What can we say about the road width of an arbitray subsectin of this road? That depends on whether the attribute road length describes, for instance the road width everywhere, meaning that road width is constant along the road, or whether it describes an aggregate property, such as minimum or average road width. In case of the minimum, for an arbitrary subsection of the road one could still argue that the minimum road with must be at least as large as the minimum road width for the whole segment, but it may no longer be the minimum for that subsection. This gives us two “types” for the attribute-geometry relationship (AGR):
- constant the attribute value is valid everywhere in or over the geometry
- aggregate the attribute is an aggregate, a summary value over the geometry
For polygon data, typical examples of constant AGR are
- land use for a land use polygon
- rock units or geologic strata in a geological map
- soil type in a soil map
- elevation class in a elevation map that shows elevation as classes
- climate zone in a climate zone map
Typical examples for the aggregate AGR are
- population, either as number of persons or as population density
- other socio-economic variables, summarised by area
- total emission of pollutants by region
- block mean NO\(_2\) concentrations, as e.g. obtained by block kriging or a dispersion model that predicts areal means
A third type of AGR is that where an attribute identifies a feature geometry. The example above is county
NAME: the name identifies the county, and is still the county
NAME for any sub-area.
- identity the attribute value uniquely identifies the geometry as a whole, there are no other geometries with the same value
Arbitrary sub-areas will lose the identity property but becomes a constant attribute. An example is:
- any point inside a county is still part of the county and must have the same value for county name, but it does not longer represent the (entire) geometry corresponding to that county.
We can specify the AGR of an attribute in an
sf object by
nc <- system.file("gpkg/nc.gpkg", package="sf") %>% read_sf() %>% st_transform(32119) nc1 <- nc %>% select(BIR74, SID74, NAME) %>% st_set_agr(c(BIR74 = "aggregate", SID74 = "aggregate", NAME = "identity"))
This helps to get rid of warnings that a particular attribute is assumed to be constant over a geometry, if it already is. The following no longer generates a warning
nc1 %>% select(NAME) %>% st_centroid() %>% head(1) #> Simple feature collection with 1 feature and 1 field #> Attribute-geometry relationship: 1 constant, 0 aggregate, 0 identity #> geometry type: POINT #> dimension: XY #> bbox: xmin: 386000 ymin: 3e+05 xmax: 386000 ymax: 3e+05 #> epsg (SRID): 32119 #> proj4string: +proj=lcc +lat_1=36.16666666666666 +lat_2=34.33333333333334 +lat_0=33.75 +lon_0=-79 +x_0=609601.22 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs #> # A tibble: 1 x 2 #> NAME geom #> <chr> <POINT [m]> #> 1 Ashe (385605 3e+05)
and also changes AGR for
constant when replacing the geometry with the geometry’s centroid:
nc1 %>% select(NAME) %>% st_centroid() %>% st_agr() #> NAME #> constant #> Levels: constant aggregate identity
Identifying attribute-geometry relationships, and warning against their absence is a first and simple implementation of the notion that the types of phenomena we encounter in spatial data science (like objects, fields, and aggregations) are not identified by their geometrical representations (points, lines, polygons, rasters). Making the wrong assumptions here easily leads to meaningless analysis results (Stasch et al. 2014,Scheider et al. (2016)).
6.2 Spatial join
Spatial joins are similar to regular (left or inner) joins, where the join criterion is not equality of one or more fields, but a spatial predicate, such as that two records have intersecting geometries. As an example, we can create a join between two tables,
a = st_sf(a = 1:2, geom = st_sfc(st_point(c(0,0)), st_point(c(1,1)))) b = st_sf(b = 3:4, geom = st_sfc(st_linestring(rbind(c(2,0), c(0,2))), st_point(c(1,1)))) st_join(a, b) #> Simple feature collection with 3 features and 2 fields #> geometry type: POINT #> dimension: XY #> bbox: xmin: 0 ymin: 0 xmax: 1 ymax: 1 #> epsg (SRID): NA #> proj4string: NA #> a b geom #> 1 1 NA POINT (0 0) #> 2 2 3 POINT (1 1) #> 2.1 2 4 POINT (1 1) st_join(a, b, left = FALSE) #> Simple feature collection with 2 features and 2 fields #> geometry type: POINT #> dimension: XY #> bbox: xmin: 1 ymin: 1 xmax: 1 ymax: 1 #> epsg (SRID): NA #> proj4string: NA #> a b geom #> 2 2 3 POINT (1 1) #> 2.1 2 4 POINT (1 1) st_join(b, a) #> Simple feature collection with 2 features and 2 fields #> geometry type: GEOMETRY #> dimension: XY #> bbox: xmin: 0 ymin: 0 xmax: 2 ymax: 2 #> epsg (SRID): NA #> proj4string: NA #> b a geom #> 1 3 2 LINESTRING (2 0, 0 2) #> 2 4 2 POINT (1 1)
We see that unless
left = FALSE, we get all elements (and geometries) from the first argument, augmented with fields of the second argument when geometries match. The example shows the case where there are two geometries matching to point (1,1). The spatial join predicate function can be freely chosen, e.g. from the binary predicates listed in section 5.1.2.
When we match two sets of polygons, it may be a bit of a mess to go through all the many matches. One way out of this is to only provide the match with the largest overlap with the target geometry, obtained by adding argument
largest = TRUE. An example of this is shown (visually) in figure 6.1.
6.3 Aggregate and summarise
sf methods for
dplyr::summarise. Both do essentially the same:
- given a grouping predicate (for
summarise, obtained from
- given an aggregation function
- aggregate the selected attributes using this function, per group
- aggregate in addition the geometries.
TRUE(the default), union the aggregated geometries.
Unioning aggregated geometries dissolves for instance internal polygon boundaries, which otherwise would lead to invalid
MULTIPOLYGON errors in subsequent analysis, or plotting of potentially unwanted internal polygon boundaries. Figure 6.2 illustrates this.
Suppose we have two datasets with different geometries and attributes (left figure 6.3), and we want to compute their intersections:
p1 = st_polygon(list(rbind(c(0,0), c(4,0), c(4,4), c(0,4), c(0,0)))) d1 = st_sf(a = c(3,1), geom = st_sfc(p1, p1 + c(4, 0))) d2 = st_sf(b = c(4), geom = st_sfc(p1 * .75 + c(3, 2)))
What will the intersection of these two objects give?
(i = st_intersection(d1, d2)) #> Warning: attribute variables are assumed to be spatially constant #> throughout all geometries #> Simple feature collection with 2 features and 2 fields #> geometry type: POLYGON #> dimension: XY #> bbox: xmin: 3 ymin: 2 xmax: 6 ymax: 4 #> epsg (SRID): NA #> proj4string: NA #> a b geom #> 1 3 4 POLYGON ((3 4, 4 4, 4 2, 3 ... #> 2 1 4 POLYGON ((4 2, 4 4, 6 4, 6 ...
plot(d1, xlim = c(0,8), ylim = c(0, 6), col = NA, border = 1, reset = FALSE) plot(d2, col = NA, border = 'red', add = TRUE, lwd = 2) plot(d1, xlim = c(0,8), ylim = c(0, 6), col = NA, border = 1, lwd = 2, reset = FALSE) plot(d2, col = NA, border = 'red', add = TRUE, lwd = 3) plot(st_geometry(i), add = TRUE, col = grey(c(.7,.9)), , border = 'green', lwd = 1)
As we see, this gives the areas of intersection, along with the corresponding attributes for both contributing objects, and a warning that attributes were assumed to be spatially constant. Although this may be convenient in some cases, it may be entirely meaningless in others. For instance, in case attribute
b in object
d2 represents the number of people living in
d2, then after the intersection we end up with twice as many people living over a smaller area.
6.5 Area-weighted interpolation
Suppose we want to combine geometries and attributes of two datasets such, that we get attribute values of the first datasets summarised for the geometries of the second. There are various ways we can go for this. The simples one, building on the previous example, would be to obtain for the geometry of
d2 the attribute of
d1 that has the largest overlap with
d2. This is obtained by
st_join(d2, d1, largest = TRUE) #> Warning: attribute variables are assumed to be spatially constant #> throughout all geometries #> Simple feature collection with 1 feature and 2 fields #> geometry type: POLYGON #> dimension: XY #> bbox: xmin: 3 ymin: 2 xmax: 6 ymax: 5 #> epsg (SRID): NA #> proj4string: NA #> b a geom #> 1 4 1 POLYGON ((3 2, 6 2, 6 5, 3 ...
Another option would be to summarise the attribute, e.g. taking its mean, regardless the amount of overlap. This is obtained by
aggregate(d1, d2, mean) #> Simple feature collection with 1 feature and 1 field #> geometry type: POLYGON #> dimension: XY #> bbox: xmin: 3 ymin: 2 xmax: 6 ymax: 5 #> epsg (SRID): NA #> proj4string: NA #> a geometry #> 1 2 POLYGON ((3 2, 6 2, 6 5, 3 ...
A third option is to apply area-weighted interpolation, meaning that we interpolate (average) the variable by taking into account the respective area contributions of overlap (Goodchild and Lam 1980,Do, Thomas-Agnan, and Vanhems (2015a),Do, Thomas-Agnan, and Vanhems (2015b)). This is done e.g. by
d3 = st_sfc(p1 * .75 + c(3, 2), p1 * .75 + c(3,3)) st_interpolate_aw(d1, d3, extensive = FALSE)$a #> Warning in st_interpolate_aw(d1, d3, extensive = FALSE): st_interpolate_aw #> assumes attributes are constant over areas of x #>  1.67 1.67 st_interpolate_aw(d1, d3, extensive = TRUE)$a #> Warning in st_interpolate_aw(d1, d3, extensive = TRUE): st_interpolate_aw #> assumes attributes are constant over areas of x #>  0.625 0.312
6.5.1 Spatially intensive and extensive variables
The difference between the two examples for area-weighted interpolation is how the final weighted sum (value times area of intersection) is normalised: by the target area (extensive), or by the sum of the area covered (intensive,
extensive = FALSE). Spatially intensive variables are variables for which the value, when we split an area, does not principally change. An example might be temperature, elevation, or population density. Spatially extensive variables are variables for which the value is also split, according to the area. Examples are population (amount), or area.
- Add a variable to the
nc$State = "North Carolina". Which value should you attach to this variable for the attribute-geometry relationship (agr)?
- Create a new
sfobject from the geometry obtained by
st_union(nc), and assign
"North Carolina"to the variable
agrcan you now assign to this attribute variable?
st_areato add a variable with name
nc. Compare the
AREAvariables in the
ncdataset. What are the units of
AREA? Are the two linearly related? If there are discrepancies, what could be the cause?
- Is the
areavariable intensive or extensive? Is its agr equal to
- Find the name of the county that contains
- Find the names of all counties with boundaries that touch county
- List the names of all counties that are less than 50 km away from county
Stasch, Christoph, Simon Scheider, Edzer Pebesma, and Werner Kuhn. 2014. “Meaningful Spatial Prediction and Aggregation.” Environmental Modelling & Software 51: 149–65. doi:10.1016/j.envsoft.2013.09.006.
Scheider, Simon, Benedikt Gräler, Edzer Pebesma, and Christoph Stasch. 2016. “Modeling Spatiotemporal Information Generation.” International Journal of Geographical Information Science 30 (10). Taylor & Francis: 1980–2008. doi:10.1080/13658816.2016.1151520.
Goodchild, Michael F, and Nina Siu Ngan Lam. 1980. Areal Interpolation: A Variant of the Traditional Spatial Problem. Department of Geography, University of Western Ontario London, ON, Canada.
Do, Van Huyen, Christine Thomas-Agnan, and Anne Vanhems. 2015a. “Accuracy of Areal Interpolation Methods for Count Data.” Spatial Statistics 14: 412–38. doi:10.1016/j.spasta.2015.07.005.
Do, Van Huyen, Christine Thomas-Agnan, and Anne Vanhems. 2015b. “Spatial Reallocation of Areal Data: A Review.” Rev. Econ. Rég. Urbaine 1/2: 27–58. https://www.tse-fr.eu/sites/default/files/medias/doc/wp/mad/wp_tse_397_v2.pdf.