This workshop was originally presented at State of the Map 2021.
As an interactive, open-source, high-level programming language designed for data science, R is a great fit for dealing with OpenStreetMap data and introducing beginners to reproducible scientific computing. In this workshop, you will learn how you can use a variety of R packages that are ideal to:
- acquire relevant OSM data in a vector format
- clean and process the data
- display it in static and interactive visualisations
- share a visualisation with others
- export the data to other programs for further editing
Most data acquisition, cleaning, processing, visualising and exporting can be done writing an R script. For further refinement of an output, other point-and-click programs (like QGIS) might be required, but storing the bulk of the process in a script allows us to easily reapply the same process on an updated dataset, share the process with others so they can learn from it, and provide supporting evidence when publishing results.
Basic knowledge of R is preferred to follow along this workshop.
If you want to run the commands on your own computer and learn by doing, please make sure you have:
In this presentation, we focus on three main packages:
Some important shortcuts for RStudio
- function or dataset help: press F1 with your cursor anywhere in a function name.
- execute from script: Ctrl + Enter
- assignment operator (
<-): Alt + -
- pipe (
|>since R 4.1): Ctrl + Shift + M
We want to acquire data about libraries around Meanjin (Brisbane), in Australia. The format we choose is called “simple features”: a convenient spatial format in R, for vector data and associated attributes.
osmdata package allows us to build Overpass queries to download the data we are interested in. The
opq() function (for “overpass query”) can be used to define the bounding box (with Nominatim), and after the pipe operator
|>, we can use
add_osm_feature() to define the relevant tag.
# load the package library(osmdata) # build a query query <- opq(bbox = "meanjin, australia") |> add_osm_feature(key = "amenity", value = "library")
Now that the query is designed, we can send it to Overpass, get the data, and store it as an
sf object with
libraries <- osmdata_sf(query)
Structure of object
We end up with a list of different kinds of vector data, because OSM contains libraries coded as both points and polygons:
In this object, we can see that most libraries in the Meanjin area are stored in OSM as polygons and points. We can do operations on specific parts, for example to see how many libraries there are in each one:
##  453
##  42
Now, let’s see the attributes available for libraries coded as points:
##  "osm_id" "name" "access" ##  "addr.city" "addr.country" "addr.full" ##  "addr.housenumber" "addr.place" "addr.postcode" ##  "addr.state" "addr.street" "addr.suburb" ##  "air_conditioning" "alt_name" "amenity" ##  "attribution" "contact.email" "contact.phone" ##  "contact.website" "description" "entrance" ##  "internet_access" "internet_access.fee" "level" ##  "note" "opening_hours" "operator" ##  "operator.type" "phone" "source" ##  "source.date" "source.url" "source_ref" ##  "toilets" "toilets.wheelchair" "website" ##  "wheelchair" "wheelchair.description" "geometry"
You can also explore the dataframe by viewing it:
You can find the OSM identifiers in the first column,
osm_id. This is what you can use if you want to see the corresponding object on the OSM website. For example, the Indooroopilly Library could be found by putting its ID
262317826 behind the URL
453 features seems a bit high… Notice that there is a lot of missing data? Many points have no tags, not even
amenity=library. Looks like many points are part of polygons, not libraries stored as single points.
We can remove the points that we are not interested in, with the package
dplyr and its function
library(dplyr) library(sf) # for the filter.sf method point_libs <- libraries$osm_points |> filter(amenity == "library")
39 libraries sounds more realistic!
From polygons to centroids
We have two kinds of vector data: points in
point_libs, and polygons in
libraries$osm_polygons. We want to process and visualise all libraries in one go, so we should convert everything to points.
sf package provides many tools to process spatial vector data. One of them is
st_centroid(), to get the centre point of a polygon.
library(sf) centroid_libs <- libraries$osm_polygons |> st_centroid()
Note: most function names in the
sfpackage will start with
st_, which stands for “space and time”. The plan is to integrate time features in the package.
Now that we have centroids, we can put them in the same object as the other points with
all_libs <- bind_rows(centroid_libs, point_libs)
bind_rows() is happy to deal with different numbers of columns and will insert
NAs as needed. We end up with a total of 57 attribute columns.
You might have to investigate the validity of the data further, for example finding out why some libraries don’t have a name. It could be that the name is simply missing, or that the library tag is duplicated between a point and a polygon.
To only see the ID of the features that don’t have a name:
all_libs |> filter(is.na(name)) |> pull(osm_id)
##  "47415319" "146583077" "330599902" "7416188347"
Reproject the data
Currently, the Coordinate Reference System (CRS) in use is EPSG:4326 - WGS 84.
## Coordinate Reference System: ## User input: EPSG:4326 ## wkt: ## GEOGCRS["WGS 84", ## ENSEMBLE["World Geodetic System 1984 ensemble", ## MEMBER["World Geodetic System 1984 (Transit)"], ## MEMBER["World Geodetic System 1984 (G730)"], ## MEMBER["World Geodetic System 1984 (G873)"], ## MEMBER["World Geodetic System 1984 (G1150)"], ## MEMBER["World Geodetic System 1984 (G1674)"], ## MEMBER["World Geodetic System 1984 (G1762)"], ## ELLIPSOID["WGS 84",6378137,298.257223563, ## LENGTHUNIT["metre",1]], ## ENSEMBLEACCURACY[2.0]], ## PRIMEM["Greenwich",0, ## ANGLEUNIT["degree",0.0174532925199433]], ## CS[ellipsoidal,2], ## AXIS["geodetic latitude (Lat)",north, ## ORDER, ## ANGLEUNIT["degree",0.0174532925199433]], ## AXIS["geodetic longitude (Lon)",east, ## ORDER, ## ANGLEUNIT["degree",0.0174532925199433]], ## USAGE[ ## SCOPE["Horizontal component of 3D system."], ## AREA["World."], ## BBOX[-90,-180,90,180]], ## ID["EPSG",4326]]
This Geographic Reference System is good for global data, but given that our data is localised, and that we might want to talk about relevant distances, we should reproject the data to a local Projected Reference System (PRS). A good PRS for around Brisbane/Meanjin is “EPSG:7856 - GDA2020 / MGA zone 56”.
# this is a first step that might not be needed, # but will avoid an error in case of GDAL version mismatch st_crs(all_libs) <- 4326 all_libs <- st_transform(all_libs, crs = 7856)
If you want to learn more about the static datum GDA2020 (for “Geocentric Datum of Australia 2020”), an upgrade from the previous, less precise GDA94, head to the ICSM website.
The default plotting method for an
sf object will create facets for the first few columns of attributes.
If you just want to have a quick look at the spread of the features on a single plot, you can extract the geometry first:
st_geometry(all_libs) |> plot()
To put the points on a map, a great package is
tmap, which can create interactive maps:
library(tmap) tmap_mode("view") tm_shape(all_libs) + tm_sf()
tm_sf() is useful to quickly visualise an
sf object, which can actually contain a combination of points, lines and polygons.
It might be a good idea to show the area of study by also showing the bounding polygon that was used to get the data.
# get a bounding box (we have to extract the multipolygon) meanjin <- getbb("meanjin, australia", format_out = "sf_polygon")$multipolygon # also fix the CRS, if needed st_crs(meanjin) <- 4326 meanjin <- st_transform(meanjin, crs = 7856) # layer the polygon below the points tm_shape(meanjin) + tm_borders() + # easy way to draw a polygon without a fill tm_shape(all_libs) + tm_dots() # just for point geometries