46 min read

Acquire and visualise OSM data with R

This workshop was originally presented at State of the Map 2021.

Introduction

As an interactive, open-source, high-level programming language designed for data science, R is a great fit for dealing with OpenStreetMap data and introducing beginners to reproducible scientific computing. In this workshop, you will learn how you can use a variety of R packages that are ideal to:

  • acquire relevant OSM data in a vector format
  • clean and process the data
  • display it in static and interactive visualisations
  • share a visualisation with others
  • export the data to other programs for further editing

Most data acquisition, cleaning, processing, visualising and exporting can be done writing an R script. For further refinement of an output, other point-and-click programs (like QGIS) might be required, but storing the bulk of the process in a script allows us to easily reapply the same process on an updated dataset, share the process with others so they can learn from it, and provide supporting evidence when publishing results.

Basic knowledge of R is preferred to follow along this workshop.

If you want to run the commands on your own computer and learn by doing, please make sure you have:

Outline

In this presentation, we focus on three main packages:

  • Acquire OSM data with osmdata
  • Massage data with sf and other packages
  • Visualise data with tmap and other packages

Some important shortcuts for RStudio

  • function or dataset help: press F1 with your cursor anywhere in a function name.
  • execute from script: Ctrl + Enter
  • assignment operator (<-): Alt + -
  • pipe (|> since R 4.1): Ctrl + Shift + M

Acquire data

We want to acquire data about libraries around Meanjin (Brisbane), in Australia. The format we choose is called “simple features”: a convenient spatial format in R, for vector data and associated attributes.

The osmdata package allows us to build Overpass queries to download the data we are interested in. The opq() function (for “overpass query”) can be used to define the bounding box (with Nominatim), and after the pipe operator |>, we can use add_osm_feature() to define the relevant tag.

# load the package
library(osmdata)
# build a query
query <- opq(bbox = "meanjin, australia") |> 
  add_osm_feature(key = "amenity", value = "library")

Now that the query is designed, we can send it to Overpass, get the data, and store it as an sf object with osmdata_sf():

libraries <- osmdata_sf(query)

Structure of object

We end up with a list of different kinds of vector data, because OSM contains libraries coded as both points and polygons:

In this object, we can see that most libraries in the Meanjin area are stored in OSM as polygons and points. We can do operations on specific parts, for example to see how many libraries there are in each one:

nrow(libraries$osm_points)
## [1] 453
nrow(libraries$osm_polygons)
## [1] 42

Now, let’s see the attributes available for libraries coded as points:

names(libraries$osm_points)
##  [1] "osm_id"                 "name"                   "access"                
##  [4] "addr.city"              "addr.country"           "addr.full"             
##  [7] "addr.housenumber"       "addr.place"             "addr.postcode"         
## [10] "addr.state"             "addr.street"            "addr.suburb"           
## [13] "air_conditioning"       "alt_name"               "amenity"               
## [16] "attribution"            "contact.email"          "contact.phone"         
## [19] "contact.website"        "description"            "entrance"              
## [22] "internet_access"        "internet_access.fee"    "level"                 
## [25] "note"                   "opening_hours"          "operator"              
## [28] "operator.type"          "phone"                  "source"                
## [31] "source.date"            "source.url"             "source_ref"            
## [34] "toilets"                "toilets.wheelchair"     "website"               
## [37] "wheelchair"             "wheelchair.description" "geometry"

You can also explore the dataframe by viewing it:

View(libraries$osm_points)

You can find the OSM identifiers in the first column, osm_id. This is what you can use if you want to see the corresponding object on the OSM website. For example, the Indooroopilly Library could be found by putting its ID 262317826 behind the URL https://www.openstreetmap.org/node/.

453 features seems a bit high… Notice that there is a lot of missing data? Many points have no tags, not even amenity=library. Looks like many points are part of polygons, not libraries stored as single points.

Cleaning up

We can remove the points that we are not interested in, with the package dplyr and its function filter():

library(dplyr)
library(sf) # for the filter.sf method
point_libs <- libraries$osm_points |> 
  filter(amenity == "library")

39 libraries sounds more realistic!

From polygons to centroids

We have two kinds of vector data: points in point_libs, and polygons in libraries$osm_polygons. We want to process and visualise all libraries in one go, so we should convert everything to points.

The sf package provides many tools to process spatial vector data. One of them is st_centroid(), to get the centre point of a polygon.

library(sf)
centroid_libs <- libraries$osm_polygons |> 
  st_centroid()

Note: most function names in the sf package will start with st_, which stands for “space and time”. The plan is to integrate time features in the package.

Now that we have centroids, we can put them in the same object as the other points with dplyr’s bind_rows() function:

all_libs <- bind_rows(centroid_libs, point_libs)

bind_rows() is happy to deal with different numbers of columns and will insert NAs as needed. We end up with a total of 57 attribute columns.

You might have to investigate the validity of the data further, for example finding out why some libraries don’t have a name. It could be that the name is simply missing, or that the library tag is duplicated between a point and a polygon.

To only see the ID of the features that don’t have a name:

all_libs |>
  filter(is.na(name)) |> 
  pull(osm_id)
## [1] "47415319"   "146583077"  "330599902"  "7416188347"

Reproject the data

Currently, the Coordinate Reference System (CRS) in use is EPSG:4326 - WGS 84.

st_crs(all_libs)
## Coordinate Reference System:
##   User input: EPSG:4326 
##   wkt:
## GEOGCRS["WGS 84",
##     ENSEMBLE["World Geodetic System 1984 ensemble",
##         MEMBER["World Geodetic System 1984 (Transit)"],
##         MEMBER["World Geodetic System 1984 (G730)"],
##         MEMBER["World Geodetic System 1984 (G873)"],
##         MEMBER["World Geodetic System 1984 (G1150)"],
##         MEMBER["World Geodetic System 1984 (G1674)"],
##         MEMBER["World Geodetic System 1984 (G1762)"],
##         ELLIPSOID["WGS 84",6378137,298.257223563,
##             LENGTHUNIT["metre",1]],
##         ENSEMBLEACCURACY[2.0]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["geodetic latitude (Lat)",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["geodetic longitude (Lon)",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     USAGE[
##         SCOPE["Horizontal component of 3D system."],
##         AREA["World."],
##         BBOX[-90,-180,90,180]],
##     ID["EPSG",4326]]

This Geographic Reference System is good for global data, but given that our data is localised, and that we might want to talk about relevant distances, we should reproject the data to a local Projected Reference System (PRS). A good PRS for around Brisbane/Meanjin is “EPSG:7856 - GDA2020 / MGA zone 56”.

# this is a first step that might not be needed,
# but will avoid an error in case of GDAL version mismatch
st_crs(all_libs) <- 4326 
all_libs <- st_transform(all_libs, crs = 7856)

If you want to learn more about the static datum GDA2020 (for “Geocentric Datum of Australia 2020”), an upgrade from the previous, less precise GDA94, head to the ICSM website.

Visualise

The default plotting method for an sf object will create facets for the first few columns of attributes.

plot(all_libs)

If you just want to have a quick look at the spread of the features on a single plot, you can extract the geometry first:

st_geometry(all_libs) |>
  plot()

To put the points on a map, a great package is tmap, which can create interactive maps:

library(tmap)
tmap_mode("view")
tm_shape(all_libs) +
  tm_sf()

tm_sf() is useful to quickly visualise an sf object, which can actually contain a combination of points, lines and polygons.

It might be a good idea to show the area of study by also showing the bounding polygon that was used to get the data.

# get a bounding box (we have to extract the multipolygon)
meanjin <- getbb("meanjin, australia", format_out = "sf_polygon")$multipolygon
# also fix the CRS, if needed
st_crs(meanjin) <- 4326 
meanjin <- st_transform(meanjin, crs = 7856)
# layer the polygon below the points
tm_shape(meanjin) +
  tm_borders() + # easy way to draw a polygon without a fill
tm_shape(all_libs) +
  tm_dots() # just for point geometries