Chapter 2 Spatial data manipulation in R

Learning Objectives

  • Join attribute data to a polygon vector file
  • Reproject a vector file
  • Select polygons of a vector by location
  • Reproject a raster
  • Perform a raster calculation

There are a wide variety of spatial, topological, and attribute data operations you can perform with R. Lovelace et al’s recent publication2 goes into great depth about this and is highly recommended.

In this section we will look at a few examples for libraries and commands that allow us to process spatial data in R and perform a few commonly used operations.

2.1 Attribute Join

An attribute join on vector data brings tabular data into a geographic context. It refers to the process of joining data in tabular format to data in a format that holds the geometries (polygon, line, or point).

If you have done attribute joins of shapefiles in GIS software like ArcGIS or QGis you know that you need a unique identifier in both the attribute table of the shapefile and the table to be joined.

First we will load the CSV table PhiladelphiaEduAttain.csv into a dataframe in R and name it ph_edu.

ph_edu <- read_csv("data/PhiladelphiaEduAttain.csv")
#> Rows: 384 Columns: 13
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr  (1): NAME
#> dbl (12): GEOID, fem_bachelor, fem_doctorate, fem_highschool, fem_noschool, ...
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
names(ph_edu)

If you don’t have the object still loaded read the the PhillyTotalPopHHinc shapefile into an object named philly_sf. Check out the column names of philly_sf and of ph_edu to determine which one might contain the unique identifier for the join.

# if you need to read in again:
# philly_sf <- st_read("data/Philly/")
names(philly_sf)

To join the ph_edu data frame with philly_sf we can use merge like this:

philly_sf_merged <- left_join(philly_sf, ph_edu, by = c("GEOID10" = "GEOID"))
names(philly_sf_merged) 
#>  [1] "STATEFP10"       "COUNTYFP10"      "TRACTCE10"       "GEOID10"        
#>  [5] "NAME10"          "NAMELSAD10"      "MTFCC10"         "FUNCSTAT10"     
#>  [9] "ALAND10"         "AWATER10"        "INTPTLAT10"      "INTPTLON10"     
#> [13] "GISJOIN"         "Shape_area"      "Shape_len"       "medHHinc"       
#> [17] "totalPop"        "NAME"            "fem_bachelor"    "fem_doctorate"  
#> [21] "fem_highschool"  "fem_noschool"    "fem_ovr_25"      "male_bachelor"  
#> [25] "male_doctorate"  "male_highschool" "male_noschool"   "male_ovr_25"    
#> [29] "pop_ovr_25"      "geometry"

We see the new attribute columns added, as well as the geometry column.

2.2 Topological Subsetting: Select Polygons by Location

For the next example our goal is to select all Philadelphia census tracts that are approximately within a range of 2 kilometers from the city center.

Think about this for a moment – what might be the steps you’d follow?

## How about:

# 1. Get the census tract polygons.
# 2. Find the Philadelphia city center coordinates.
# 3. Create a buffer around the city center point.
# 4. Select all census tract polygons that intersect with the center buffer

We will use philly_sf for the census tract polygons.

In addition, we need to create a sf Point object with the Philadelphia city center coordinates:

\[x = 1750160\] \[y = 467499.9\]

These coordinates are also in the USA Contiguous Albers Equal Area Conic projected CRS, which is the same as CRS as philly_sf.

With this information, we create a object that holds the coordinates of the city center. Since we don’t have attributes we will just create it as a simple feature collection, scf.

# if you need to read in again:
# philly_sf <- st_read("data/Philly/", quiet = T)

# make a simple feature point with CRS
philly_ctr <- 
  st_point(c(1750160, 467499.9)) %>% # point coordinates
  st_sfc(crs = st_crs(philly_sf))  # create feature collection

For the spatial operations we can recur to the suite of geometric operations that come with the sf package.

We create a 2km buffer around the city center point:

philly_buf <- st_buffer(philly_ctr, 2000)

Ok. Now we can use that buffer to select all census tract polygons that intersect with the center buffer. The sf package has its own st_filter function that we can use here.

philly_sel <- st_filter(philly_sf, philly_buf)

Then we can plot what we created like this:

plot(st_geometry(philly_sf), border="#aaaaaa", main="Census tracts that overlap with 2km buffer\naround city center")
plot(st_geometry(philly_sel), add=T, col="red")
plot(st_geometry(philly_buf), add=T, lwd = 2)

Note the difference to st_intersection, which performs an actual clippin operation and creates an sfc object that cuts out the area of the buffer from the census polygons:

philly_intersection <- st_intersection(philly_buf, philly_sf)
philly_intersection
#> Geometry set for 46 features 
#> Geometry type: GEOMETRY
#> Dimension:     XY
#> Bounding box:  xmin: 1748160 ymin: 465499.9 xmax: 1752160 ymax: 469499.9
#> Projected CRS: Albers
#> First 5 geometries:
#> POLYGON ((1752157 467395.2, 1752153 467339.7, 1...
#> POLYGON ((1752149 467290.8, 1752135 467187, 175...
#> POLYGON ((1751347 466573.5, 1751321 467037.6, 1...
#> POLYGON ((1750298 467616.7, 1750148 467615.2, 1...
#> POLYGON ((1748853 467255, 1748874 467605.3, 174...
plot(st_geometry(philly_sf), border="#aaaaaa", main="Census tracts around city center,\nclipped by 2km buffer ")
plot(philly_intersection, add=T, lwd = 2, border = "red")

2.3 Reprojecting

Occasionally you may have to change the coordinates of your spatial object into a new Coordinate Reference System (CRS). Functions to transform, or reproject spatial objects typically take the following two arguments:

  • the spatial object to reproject
  • a CRS object with the new projection definition

You can reproject

  • a sf object with st_transform()
  • a SpatRaster object with project()

The perhaps trickiest part here is to determine the definition of the projection, which needs to be a character string in proj4 format. You can look it up online. For example for UTM zone 33N (EPSG:32633) the string would be:

+proj=utm +zone=33 +ellps=WGS84 +datum=WGS84 +units=m +no_defs

You can retrieve the CRS:

  • from an sf object with st_crs()
  • from a SpatRaster object with crs()

Let us go back to the "PhillyHomicides" shapefile we exported earlier. Let’s read it back in and reproject it so it matches the projection of the Philadelphia Census tracts.

Now let us check the CRS for both files.

#If you need to read the file back in:
#philly_homicides_sf <- st_read("data/PhillyHomicides/")

st_crs(philly_sf)$proj4string
#> [1] "+proj=aea +lat_0=37.5 +lon_0=-96 +lat_1=29.5 +lat_2=45.5 +x_0=0 +y_0=0 +ellps=GRS80 +units=m +no_defs"
st_crs(philly_homicides_sf)$proj4string
#> [1] "+proj=longlat +datum=WGS84 +no_defs"

We see that the CRS are different: we have +proj=aea... and +proj=longlat.... AEA refers to USA Contiguous Albers Equal Area Conic which is a projected coordinate system with numeric units. We will need this below for our spatial operations, so we will make sure both files are in that same CRS.

We use st_transform and assign the result to a new object. Note how we also use str_crs to extract the projection definition from philly_sf, so we don’t have to type it out.

philly_homicides_sf_aea <- st_transform(philly_homicides_sf, st_crs(philly_sf))

We can use the range() command from the R base package to compare the coordinates before and after reprojection and confirm that we actually have transformed them. range() returns the min and max value of a vector of numbers.

range(st_coordinates(philly_homicides_sf))
#> [1] -75.26809  40.13086
range(st_coordinates(philly_homicides_sf_aea))
#> [1]  457489.7 1763671.8

We can also compare them visually with:

par(mfrow=c(1,2)) 
plot(st_geometry(philly_homicides_sf), axes=TRUE, main = "before transform - latlon")
plot(st_geometry(philly_homicides_sf_aea), axes=TRUE, main = "after transform - aea")

Lastly, let us save the reprojected file as PhillyHomicides_aea shapefile, as we will use it later on.

st_write(philly_homicides_sf_aea, "data/PhillyHomicides_aea", driver = "ESRI Shapefile")

2.3.1 Raster reprojection

Here is what it would look like to reproject the HARV raster used earlier to a WGS84 projection. We see that see that the original projection is in UTM.

# if you need to load again:
#HARV <- raster("data/HARV_RGB_Ortho.tif")
crs(HARV, proj = TRUE)
#> [1] "+proj=utm +zone=18 +datum=WGS84 +units=m +no_defs"
HARV_WGS84 <- project(HARV, "+init=epsg:4326")
crs(HARV_WGS84, proj = TRUE)
#> [1] "+proj=longlat +datum=WGS84 +no_defs"

Let’s look at the coordinates to see the effect:

ext(HARV)
#> SpatExtent : 731998.5, 732766.75, 4712956.25, 4713535.5 (xmin, xmax, ymin, ymax)
ext(HARV_WGS84)
#> SpatExtent : -72.1750316832584, -72.1654516638594, 42.5339461813323, 42.5393881837189 (xmin, xmax, ymin, ymax)

Due to the reprojection the number of cells has also changed:

ncell(HARV)
#> [1] 7120141
ncell(HARV_WGS84)
#> [1] 6859650

And here is the visual proof:

plot(HARV, main = "before transform - UTM")

plot(HARV_WGS84, main = "after transform - WGS84")

2.4 Spatial Aggregation: Points in Polygons

Now that we have both homicides and census tracts in the same projection we will forge ahead and ask for the density of homicides for each census tract in Philadelphia: \(\frac{{homicides}}{area}\)

To achieve this this we join the points of homicide incidence to the census tract polygon and count them up for each polygon. You might be familiar with this operation from other GIS packages.

We will use piping and build up our object in the following way. First we calculate the area for each tract. We use the st_area function on the geometry column and add the result.

philly_sf %>% 
  mutate(tract_area = st_area(geometry)) %>% 
  head()

Next, we use st_join to perform a spatial join with the points:

philly_sf %>% 
  mutate(tract_area = st_area(geometry)) %>% 
  st_join(philly_homicides_sf_aea) %>%
  head()

Now we can group by a variable that uiquely identifies the census tracts, (we choose GEOID10) and use summarize to count the points for each tract and calculate the homicide rate. Since our units are in sq meter we multiply by by 1000000 to get sq km. We also need to carry over the area, which I do using unique.

We also assign the output to a new object philly_crimes_sf.

philly_crimes_sf <- philly_sf %>%
      mutate(tract_area = st_area(geometry)) %>%
      st_join(philly_homicides_sf_aea) %>%
      group_by(GEOID10) %>%
      summarize(n_homic = n(),
                tract_area = unique(tract_area),
                homic_rate = as.numeric(1e6 * (n_homic/tract_area))) 

Finally, we write this out for later:

st_write(philly_crimes_sf, "data/PhillyCrimerate", driver = "ESRI Shapefile")

2.5 Raster calculations with terra

We often want to perform calculations on two or more rasters to create a new output raster. For example, if we are interested in mapping the heights of trees across an entire field site, we might want to calculate the difference between the Digital Surface Model (DSM, tops of trees) and the Digital Terrain Model (DTM, ground level). The resulting dataset is referred to as a Canopy Height Model (CHM) and represents the actual height of trees, buildings, etc. with the influence of ground elevation removed.

First let’s read in the two datasets.

HARV_DTM <- rast("data/HARV_dtmCrop.tif")
HARV_DSM <- rast("data/HARV_dsmCrop.tif")

Now we can subtract the DTM from the DSM to create a Canopy Height Model. It will for each CHM pixel calculate the difference of the respective DTM and DSM pixels.

HARV_CHM <- HARV_DSM - HARV_DTM
par(mfrow = c(1, 2))
plot(HARV_CHM)
hist(HARV_CHM)
#> Warning: [hist] a sample of 43% of the cells was used (of which 0% was NA)

This works fine for the small rasters in this tutorial. However, the calculation above becomes less efficient when computations are more complex or file sizes become large.

Thet terra package contains a function called lapp()function to make processing more efficient. It takes two or more rasters and applies a function to them. The generic syntax is:

outputRaster <- lapp(x, fun)

where x is a SpatRasterDataset and fun is a custom function for the operation we want to perform.

CHM_ov_HARV <- lapp(sds(list(HARV_DSM, HARV_DTM)), # use sds to create a SpatRasterDataset
                    fun = function(r1, r2) { 
                      return( r1 - r2) 
                      })

As arguments for our lapp operation we use the sds() function and provide it with the list of rasters that we want to operate on. As custom function we provide the function with two arguments (r1 and r1) that subtracts the second (r2) from the first (r1) and returns the difference. The output of lapp is a SpatRaster and we assign it to a new variable CHM_ov_HARV.

The two rasters should be the same.

identical(HARV_CHM, CHM_ov_HARV)
#> [1] TRUE

  1. Lovelace, R., Nowosad, J., & Muenchow, J. (2024). Geocomputation with R. CRC Press.↩︎