# Chapter 6 Feature attributes

Feature *attributes* refer to the properties of features (“things”) that do not describe the feature’s geometry. Feature attributes can be *derived* from geometry (e.g. length of a `LINESTRING`

, area of a `POLYGON`

) but they can also refer to completely different properties, such as

- the name of a street or a county,
- the number of people living in a country,
- the type of a road
- the soil type in a polygon from a soil map.
- the opening hours of a shop
- the body weight of an animal
- the NO\(_2\) concentration measured at an air quality monitoring station

Although temporal properties of features are no less fundamental than their spatial properties, the simple feature access standard and consequently the `sf`

package does not give time a similar role as space; more on that in chapter 4.

Most `sf`

objects will contain both geometries and attributes for features. The geometric operations described in the previous chapter (5) operate on geometries *only*, and may occasionally add attributes, but will not modify attributes present.

In all these cases, attribute *values* remain unmodified. At first sight, that looks rather harmless. But if we look into a simple case of replacing a county boundary with a county centroid, as in

```
library(sf)
library(dplyr)
system.file("gpkg/nc.gpkg", package="sf") %>%
read_sf() %>%
st_transform(32119) %>%
select(BIR74, SID74, NAME) %>%
st_centroid() %>%
head(n = 1) -> x # save as x
#> Warning in st_centroid.sf(.): st_centroid assumes attributes are constant
#> over geometries of x
st_geometry(x)[[1]]
#> POINT (385605 3e+05)
```

we receive a warning. This warning is justified for the first two variables shown (total births and number of SID disease cases, 1974) which, as such, are *not associated with* a feature whose geometry is `POINT (385605.4 300303.5)`

. The third variable, `NAME`

is however still the county name for the point indicated, but the point geometry no longer *is* the county geometry.

## 6.1 Attribute-geometry relationships

Changing the feature geometry without changing the feature attributes does change the *feature*, since the feature is characterised by the combination of geometry and attributes. Can we, ahead of time, predict whether the resulting feature will still meaningfully relate to the attribute data when we replace all geometries for instance with their convex hull or centroid? It depends.

Take the example of a road, represented by a `LINESTRING`

, which has an attribute property *road width* equal to 10 m. What can we say about the road width of an arbitray subsectin of this road? That depends on whether the attribute road length describes, for instance the road width everywhere, meaning that road width is constant along the road, or whether it describes an aggregate property, such as minimum or average road width. In case of the minimum, for an arbitrary subsection of the road one could still argue that the minimum road with must be at least as large as the minimum road width for the whole segment, but it may no longer be *the minimum* for that subsection. This gives us two “types” for the attribute-geometry relationship (AGR):

**constant**the attribute value is valid everywhere in or over the geometry**aggregate**the attribute is an aggregate, a summary value over the geometry

For polygon data, typical examples of **constant** AGR are

- land use for a land use polygon
- rock units or geologic strata in a geological map
- soil type in a soil map
- elevation class in a elevation map that shows elevation as classes
- climate zone in a climate zone map

Typical examples for the **aggregate** AGR are

- population, either as number of persons or as population density
- other socio-economic variables, summarised by area
- total emission of pollutants by region
- block mean NO\(_2\) concentrations, as e.g. obtained by block kriging or a dispersion model that predicts areal means

A third type of AGR is that where an attribute identifies a feature geometry. The example above is county `NAME`

: the name identifies the county, and is still the county `NAME`

for any sub-area.

**identity**the attribute value uniquely identifies the geometry as a whole, there are no other geometries with the same value

Arbitrary sub-areas will lose the **identity** property but becomes a **constant** attribute. An example is:

- any point inside a county is still part of the county and must have the same value for county name, but it does not longer represent the (entire) geometry corresponding to that county.

We can specify the AGR of an attribute in an `sf`

object by `st_set_agr`

:

```
nc <- system.file("gpkg/nc.gpkg", package="sf") %>%
read_sf() %>%
st_transform(32119)
nc1 <- nc %>% select(BIR74, SID74, NAME) %>%
st_set_agr(c(BIR74 = "aggregate", SID74 = "aggregate", NAME = "identity"))
```

This helps to get rid of warnings that a particular attribute is assumed to be constant over a geometry, if it already is. The following no longer generates a warning

```
nc1 %>% select(NAME) %>%
st_centroid() %>%
head(1)
#> Simple feature collection with 1 feature and 1 field
#> Attribute-geometry relationship: 1 constant, 0 aggregate, 0 identity
#> geometry type: POINT
#> dimension: XY
#> bbox: xmin: 386000 ymin: 3e+05 xmax: 386000 ymax: 3e+05
#> epsg (SRID): 32119
#> proj4string: +proj=lcc +lat_1=36.16666666666666 +lat_2=34.33333333333334 +lat_0=33.75 +lon_0=-79 +x_0=609601.22 +y_0=0 +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs
#> # A tibble: 1 x 2
#> NAME geom
#> <chr> <POINT [m]>
#> 1 Ashe (385605 3e+05)
```

and also changes AGR for `NAME`

from `identity`

to `constant`

when replacing the geometry with the geometry’s centroid:

```
nc1 %>% select(NAME) %>%
st_centroid() %>%
st_agr()
#> NAME
#> constant
#> Levels: constant aggregate identity
```

Identifying attribute-geometry relationships, and warning against their absence is a first and simple implementation of the notion that the types of phenomena we encounter in spatial data science (like objects, fields, and aggregations) are not identified by their geometrical representations (points, lines, polygons, rasters). Making the wrong assumptions here easily leads to meaningless analysis results (Stasch et al. 2014,Scheider et al. (2016)).

## 6.2 Spatial join

Spatial joins are similar to regular (left or inner) joins, where the join criterion is not equality of one or more fields, but a spatial predicate, such as that two records have intersecting geometries. As an example, we can create a join between two tables,

```
a = st_sf(a = 1:2, geom = st_sfc(st_point(c(0,0)), st_point(c(1,1))))
b = st_sf(b = 3:4, geom = st_sfc(st_linestring(rbind(c(2,0), c(0,2))), st_point(c(1,1))))
st_join(a, b)
#> Simple feature collection with 3 features and 2 fields
#> geometry type: POINT
#> dimension: XY
#> bbox: xmin: 0 ymin: 0 xmax: 1 ymax: 1
#> epsg (SRID): NA
#> proj4string: NA
#> a b geom
#> 1 1 NA POINT (0 0)
#> 2 2 3 POINT (1 1)
#> 2.1 2 4 POINT (1 1)
st_join(a, b, left = FALSE)
#> Simple feature collection with 2 features and 2 fields
#> geometry type: POINT
#> dimension: XY
#> bbox: xmin: 1 ymin: 1 xmax: 1 ymax: 1
#> epsg (SRID): NA
#> proj4string: NA
#> a b geom
#> 2 2 3 POINT (1 1)
#> 2.1 2 4 POINT (1 1)
st_join(b, a)
#> Simple feature collection with 2 features and 2 fields
#> geometry type: GEOMETRY
#> dimension: XY
#> bbox: xmin: 0 ymin: 0 xmax: 2 ymax: 2
#> epsg (SRID): NA
#> proj4string: NA
#> b a geom
#> 1 3 2 LINESTRING (2 0, 0 2)
#> 2 4 2 POINT (1 1)
```

We see that unless `left = FALSE`

, we get all elements (and geometries) from the first argument, augmented with fields of the second argument when geometries match. The example shows the case where there are two geometries matching to point (1,1). The spatial join predicate function can be freely chosen, e.g. from the binary predicates listed in section 5.1.2.

When we match two sets of polygons, it may be a bit of a mess to go through all the many matches. One way out of this is to only provide the match with the largest overlap with the target geometry, obtained by adding argument `largest = TRUE`

. An example of this is shown (visually) in figure 6.1.

## 6.3 Aggregate and summarise

Package `sf`

provides `sf`

methods for `stats::aggregate`

and `dplyr::summarise`

. Both do essentially the same:

- given a grouping predicate (for
`summarise`

, obtained from`group_by`

) - given an aggregation function
- aggregate the selected attributes using this function, per group
- aggregate in addition the geometries.
- if
`do_union`

is`TRUE`

(the default), union the aggregated geometries.

Unioning aggregated geometries dissolves for instance internal polygon boundaries, which otherwise would lead to invalid `MULTIPOLYGON`

errors in subsequent analysis, or plotting of potentially unwanted internal polygon boundaries. Figure 6.2 illustrates this.

## 6.4 Intersections

Suppose we have two datasets with different geometries and attributes (left figure 6.3), and we want to compute their intersections:

```
p1 = st_polygon(list(rbind(c(0,0), c(4,0), c(4,4), c(0,4), c(0,0))))
d1 = st_sf(a = c(3,1), geom = st_sfc(p1, p1 + c(4, 0)))
d2 = st_sf(b = c(4), geom = st_sfc(p1 * .75 + c(3, 2)))
```

What will the intersection of these two objects give?

```
(i = st_intersection(d1, d2))
#> Warning: attribute variables are assumed to be spatially constant
#> throughout all geometries
#> Simple feature collection with 2 features and 2 fields
#> geometry type: POLYGON
#> dimension: XY
#> bbox: xmin: 3 ymin: 2 xmax: 6 ymax: 4
#> epsg (SRID): NA
#> proj4string: NA
#> a b geom
#> 1 3 4 POLYGON ((3 4, 4 4, 4 2, 3 ...
#> 2 1 4 POLYGON ((4 2, 4 4, 6 4, 6 ...
```

```
plot(d1, xlim = c(0,8), ylim = c(0, 6), col = NA, border = 1, reset = FALSE)
plot(d2, col = NA, border = 'red', add = TRUE, lwd = 2)
plot(d1, xlim = c(0,8), ylim = c(0, 6), col = NA, border = 1, lwd = 2, reset = FALSE)
plot(d2, col = NA, border = 'red', add = TRUE, lwd = 3)
plot(st_geometry(i), add = TRUE, col = grey(c(.7,.9)), , border = 'green', lwd = 1)
```

As we see, this gives the areas of intersection, along with the corresponding attributes for both contributing objects, and a warning that attributes were assumed to be spatially constant. Although this may be convenient in some cases, it may be entirely meaningless in others. For instance, in case attribute `b`

in object `d2`

represents the number of people living in `d2`

, then after the intersection we end up with twice as many people living over a smaller area.

As seen in section 5.5, computing intersections easily leads to errors caused by invalid geometries. Setting precision (section 5.4) may prevent this.

## 6.5 Area-weighted interpolation

Suppose we want to combine geometries and attributes of two datasets such, that we get attribute values of the first datasets summarised for the geometries of the second. There are various ways we can go for this. The simples one, building on the previous example, would be to obtain for the geometry of `d2`

the attribute of `d1`

that has the largest overlap with `d2`

. This is obtained by

```
st_join(d2, d1, largest = TRUE)
#> Warning: attribute variables are assumed to be spatially constant
#> throughout all geometries
#> Simple feature collection with 1 feature and 2 fields
#> geometry type: POLYGON
#> dimension: XY
#> bbox: xmin: 3 ymin: 2 xmax: 6 ymax: 5
#> epsg (SRID): NA
#> proj4string: NA
#> b a geom
#> 1 4 1 POLYGON ((3 2, 6 2, 6 5, 3 ...
```

Another option would be to summarise the attribute, e.g. taking its mean, regardless the amount of overlap. This is obtained by

```
aggregate(d1, d2, mean)
#> Simple feature collection with 1 feature and 1 field
#> geometry type: POLYGON
#> dimension: XY
#> bbox: xmin: 3 ymin: 2 xmax: 6 ymax: 5
#> epsg (SRID): NA
#> proj4string: NA
#> a geometry
#> 1 2 POLYGON ((3 2, 6 2, 6 5, 3 ...
```

A third option is to apply area-weighted interpolation, meaning that we interpolate (average) the variable by taking into account the respective area contributions of overlap (Goodchild and Lam 1980,Do, Thomas-Agnan, and Vanhems (2015a),Do, Thomas-Agnan, and Vanhems (2015b)). This is done e.g. by

```
d3 = st_sfc(p1 * .75 + c(3, 2), p1 * .75 + c(3,3))
st_interpolate_aw(d1, d3, extensive = FALSE)$a
#> Warning in st_interpolate_aw(d1, d3, extensive = FALSE): st_interpolate_aw
#> assumes attributes are constant over areas of x
#> [1] 1.67 1.67
st_interpolate_aw(d1, d3, extensive = TRUE)$a
#> Warning in st_interpolate_aw(d1, d3, extensive = TRUE): st_interpolate_aw
#> assumes attributes are constant over areas of x
#> [1] 0.625 0.312
```

### 6.5.1 Spatially intensive and extensive variables

The difference between the two examples for area-weighted interpolation is how the final weighted sum (value times area of intersection) is normalised: by the target area (extensive), or by the sum of the area covered (intensive, `extensive = FALSE`

). Spatially intensive variables are variables for which the value, when we split an area, does not *principally* change. An example might be temperature, elevation, or population density. Spatially extensive variables are variables for which the value is also split, according to the area. Examples are population (amount), or area.

## 6.6 Exercises

- Add a variable to the
`nc`

dataset by`nc$State = "North Carolina"`

. Which value should you attach to this variable for the attribute-geometry relationship (agr)? - Create a new
`sf`

object from the geometry obtained by`st_union(nc)`

, and assign`"North Carolina"`

to the variable`State`

. Which`agr`

can you now assign to this attribute variable? - Use
`st_area`

to add a variable with name`area`

to`nc`

. Compare the`area`

and`AREA`

variables in the`nc`

dataset. What are the units of`AREA`

? Are the two linearly related? If there are discrepancies, what could be the cause? - Is the
`area`

variable intensive or extensive? Is its agr equal to`constant`

,`identity`

or`aggregate`

? - Find the name of the county that contains
`POINT(-78.34046 35.017)`

- Find the names of all counties with boundaries that touch county
`Sampson`

. - List the names of all counties that are less than 50 km away from county
`Sampson`

.

### References

Stasch, Christoph, Simon Scheider, Edzer Pebesma, and Werner Kuhn. 2014. “Meaningful Spatial Prediction and Aggregation.” *Environmental Modelling & Software* 51: 149–65. doi:10.1016/j.envsoft.2013.09.006.

Scheider, Simon, Benedikt Gräler, Edzer Pebesma, and Christoph Stasch. 2016. “Modeling Spatiotemporal Information Generation.” *International Journal of Geographical Information Science* 30 (10). Taylor & Francis: 1980–2008. doi:10.1080/13658816.2016.1151520.

Goodchild, Michael F, and Nina Siu Ngan Lam. 1980. *Areal Interpolation: A Variant of the Traditional Spatial Problem*. Department of Geography, University of Western Ontario London, ON, Canada.

Do, Van Huyen, Christine Thomas-Agnan, and Anne Vanhems. 2015a. “Accuracy of Areal Interpolation Methods for Count Data.” *Spatial Statistics* 14: 412–38. doi:10.1016/j.spasta.2015.07.005.

Do, Van Huyen, Christine Thomas-Agnan, and Anne Vanhems. 2015b. “Spatial Reallocation of Areal Data: A Review.” *Rev. Econ. Rég. Urbaine* 1/2: 27–58. https://www.tse-fr.eu/sites/default/files/medias/doc/wp/mad/wp_tse_397_v2.pdf.