11 min read

R data manipulation with RStudio and dplyr

What are we going to learn?

In this hands-on session, you will use R, RStudio and the dplyr package to transform your data.

Specifically, you will learn how to explore, filter, reorganise and process a table of data with the following verbs:

  • select(): pick variables
  • filter(): pick observations
  • arrange(): reorder observations
  • mutate(): create new variables
  • summarise(): collapse to a single summary
  • group_by(): change the scope of function

Gapminder data

You should already have the data ready from the introduction chapter, but if you need to import it again, run the following command (to read directly from the online CSV):

gapminder <- read.csv("https://raw.githubusercontent.com/resbaz/r-novice-gapminder-files/master/data/gapminder-FiveYearData.csv")

Basic dplyr verbs

The R package dplyr was developed by Hadley Wickham for data manipulation.

The book R for Data Science introduces the package as follows:

You are going to learn the five key dplyr functions that allow you to solve the vast majority of your data manipulation challenges:

  • Pick variables by their names with select()
  • Pick observations by their values with filter()
  • Reorder the rows with arrange()
  • Create new variables with functions of existing variables with mutate()
  • Collapse many values down to a single summary with summarise()

These can all be used in conjunction with group_by() which changes the scope of each function from operating on the entire dataset to operating on it group-by-group. These six functions provide the main verbs for a language of data manipulation.

To use the verbs to their full extent, we will need pipes and logical operators, which we will introduce as we go.

Let’s load the dplyr package to access its functions:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

You only need to install a package once (with install.packages()), but you need to reload it every time you start a new R session (with library()).

1. Pick variables with select()

select() allows us to pick variables (i.e. columns) from the dataset. For example, to only keep the data about year, country and GDP per capita:

gap_small <- select(gapminder, year, country, gdpPercap)

The first argument refers to the dataframe that is being transformed, and the following arguments are the columns you want to keep. Notice that it keeps the order you specified?

You can also rename columns in the same command:

gap_small <- select(gapminder, year, country, gdpPerPerson = gdpPercap)

Finally, if you have many variables but only want to remove a small number, it might be better to deselect instead of selecting. You can do that by using the - character in front of a variable name:

names(select(gapminder, -continent))
## [1] "country"   "year"      "pop"       "lifeExp"   "gdpPercap"

2. Pick observations with filter()

The filter() function allows use to pick observations depending on one or several conditions. But to be able to define these conditions, we need to learn about logical operators.

Logical operators allow us to compare things. Here are some of the most important ones:

  • ==: equal
  • !=: different
  • >: greater than
  • <: smaller than
  • >=: greater or equal
  • <=: smaller or equal

Remember: = is used to pass on a value to an argument, whereas == is used to check for equality.

You can compare any kind of data For example:

1 == 1
## [1] TRUE
1 == 2
## [1] FALSE
1 > 0
## [1] TRUE
"money" == "happiness"
## [1] FALSE

When R executes these commands, it answers TRUE of FALSE, as if asked a yes/no question. These TRUE and FALSE values are called logical values.

Note that we can compare a single value to many. For example, compare one value to five other:

1 == c(1, 2, 3, 1, 3)
## [1]  TRUE FALSE FALSE  TRUE FALSE

This kind of operation results in a logical vector with a logical value for each element. This is exactly what we will use to filter our rows.

For example, to filter the observations for Australia, we can use the following condition:

australia <- filter(gapminder, country == "Australia")
australia
##      country year      pop continent lifeExp gdpPercap
## 1  Australia 1952  8691212   Oceania  69.120  10039.60
## 2  Australia 1957  9712569   Oceania  70.330  10949.65
## 3  Australia 1962 10794968   Oceania  70.930  12217.23
## 4  Australia 1967 11872264   Oceania  71.100  14526.12
## 5  Australia 1972 13177000   Oceania  71.930  16788.63
## 6  Australia 1977 14074100   Oceania  73.490  18334.20
## 7  Australia 1982 15184200   Oceania  74.740  19477.01
## 8  Australia 1987 16257249   Oceania  76.320  21888.89
## 9  Australia 1992 17481977   Oceania  77.560  23424.77
## 10 Australia 1997 18565243   Oceania  78.830  26997.94
## 11 Australia 2002 19546792   Oceania  80.370  30687.75
## 12 Australia 2007 20434176   Oceania  81.235  34435.37

The function compares the value “Australia” to all the values in the country variable, and only keeps the rows that have TRUE as an answer.

Now, let’s filter the rows that have a life expectancy lifeExp greater than 81 years:

life81 <- filter(gapminder, lifeExp > 81)
dim(life81)
## [1] 7 6

3. Reorder observations with arrange()

arrange() will reorder our rows according to a variable, by default in ascending order:

arrange(life81, lifeExp)
##           country year       pop continent lifeExp gdpPercap
## 1       Australia 2007  20434176   Oceania  81.235  34435.37
## 2 Hong Kong China 2002   6762476      Asia  81.495  30209.02
## 3     Switzerland 2007   7554661    Europe  81.701  37506.42
## 4         Iceland 2007    301931    Europe  81.757  36180.79
## 5           Japan 2002 127065841      Asia  82.000  28604.59
## 6 Hong Kong China 2007   6980412      Asia  82.208  39724.98
## 7           Japan 2007 127467972      Asia  82.603  31656.07

If we want to have a look at the entries with highest life expectancy first, we can use the desc() function (for “descending”):

arrange(life81, desc(lifeExp))
##           country year       pop continent lifeExp gdpPercap
## 1           Japan 2007 127467972      Asia  82.603  31656.07
## 2 Hong Kong China 2007   6980412      Asia  82.208  39724.98
## 3           Japan 2002 127065841      Asia  82.000  28604.59
## 4         Iceland 2007    301931    Europe  81.757  36180.79
## 5     Switzerland 2007   7554661    Europe  81.701  37506.42
## 6 Hong Kong China 2002   6762476      Asia  81.495  30209.02
## 7       Australia 2007  20434176   Oceania  81.235  34435.37

We could also use the - shortcut, which only works for numerical data:

arrange(life81, -lifeExp)

The pipe operator

What if we wanted to get that result in one single command, without an intermediate life81 object?

We could nest the commands into each other, the first step as the first argument of the second step:

arrange(filter(gapminder, lifeExp > 81), -lifeExp)

… but this becomes very hard to read, very quickly. (Imagine with 3 steps or more!)

We can make our code more readable and avoid creating useless intermediate objects by piping commands into each other. The pipe operator %>% strings commands together, using the left side’s output as the first argument of the right side function.

For example, this command:

round(1.23, digits = 1)
## [1] 1.2

… is equivalent to:

1.23 %>% round(digits = 1)
## [1] 1.2

Here’s another example with the filter() verb:

gapminder %>%
  filter(country != "France")

… becomes:

filter(gapminder, country != "France")

To do what we did previously in one single command, using the pipe:

gapminder %>% 
  filter(lifeExp > 81) %>% 
  arrange(-lifeExp)
##           country year       pop continent lifeExp gdpPercap
## 1           Japan 2007 127467972      Asia  82.603  31656.07
## 2 Hong Kong China 2007   6980412      Asia  82.208  39724.98
## 3           Japan 2002 127065841      Asia  82.000  28604.59
## 4         Iceland 2007    301931    Europe  81.757  36180.79
## 5     Switzerland 2007   7554661    Europe  81.701  37506.42
## 6 Hong Kong China 2002   6762476      Asia  81.495  30209.02
## 7       Australia 2007  20434176   Oceania  81.235  34435.37

The pipe operator can be read as “then” and makes the code a lot more readable than when nesting functions into each other, and avoids the creation of several intermediate objects. It is also easier to trouble shoot as it makes it easy to execute the pipeline step by step.

From now on, we’ll use the pipe syntax as a default.

Note that this material uses the magrittr pipe. The magrittr package is the one that introduced the pipe operator to the R world, and dplyr automatically imports this useful operator when it is loaded. However, the pipe being such a widespread and popular concept in programming and data science, it ended up making it into Base R (the “native” pipe) in 2021 with the release of R 4.1, using a different operator: |>. You can switch your pipe shortcut to the native pipe in Tools > Global options > Code > Use native pipe operator.

Challenge 1 – a tiny dataset

Select the 2002 life expectancy observation for Eritrea (and remove the rest of the variables).

eritrea_2002 <- gapminder %>%
    select(year, country, lifeExp) %>%
    filter(country == "Eritrea", year == 2002)

4. Create new variables with mutate()

Have a look at what the verb mutate() can do with ?mutate.

Let’s see what the two following variables can be used for:

gapminder %>%
  select(gdpPercap, pop) %>%
  head()
##   gdpPercap      pop
## 1  779.4453  8425333
## 2  820.8530  9240934
## 3  853.1007 10267083
## 4  836.1971 11537966
## 5  739.9811 13079460
## 6  786.1134 14880372

How do you think we could combine them to add something new to our dataset?

Challenge 2 – mutate the GDP

Use mutate() to create a gdp variable.

Name your new dataset gap_gdp. When finished, dim(gap_gdp) should result in 1704 7.

Hint: use the * operator within mutate().

gap_gdp <- gapminder %>%
    mutate(gdp = gdpPercap * pop)
dim(gap_gdp)
## [1] 1704    7
head(gap_gdp)
##       country year      pop continent lifeExp gdpPercap         gdp
## 1 Afghanistan 1952  8425333      Asia  28.801  779.4453  6567086330
## 2 Afghanistan 1957  9240934      Asia  30.332  820.8530  7585448670
## 3 Afghanistan 1962 10267083      Asia  31.997  853.1007  8758855797
## 4 Afghanistan 1967 11537966      Asia  34.020  836.1971  9648014150
## 5 Afghanistan 1972 13079460      Asia  36.088  739.9811  9678553274
## 6 Afghanistan 1977 14880372      Asia  38.438  786.1134 11697659231

You can reuse a variable computed by ‘mutate()’ straight away. For example, we also want a more readable version of our new variable, in billion dollars:

gap_gdp <- gapminder %>%
    mutate(gdp = gdpPercap * pop,
           gdpBil = gdp / 1e9)

5. Collapse to a single value with summarise()

summarise() collapses many values down to a single summary. For example, to find the mean life expectancy for the whole dataset:

gapminder %>%
  summarise(meanLE = mean(lifeExp))
##     meanLE
## 1 59.47444

However, a single-value summary is not particularly interesting. summarise() becomes more powerful when used with group_by().

6. Change the scope with group_by()

group_by() changes the scope of the following function(s) from operating on the entire dataset to operating on it group-by-group.

See the effect of the grouping step:

gapminder %>%
    group_by(continent)
## # A tibble: 1,704 × 6
## # Groups:   continent [5]
##    country      year      pop continent lifeExp gdpPercap
##    <chr>       <int>    <dbl> <chr>       <dbl>     <dbl>
##  1 Afghanistan  1952  8425333 Asia         28.8      779.
##  2 Afghanistan  1957  9240934 Asia         30.3      821.
##  3 Afghanistan  1962 10267083 Asia         32.0      853.
##  4 Afghanistan  1967 11537966 Asia         34.0      836.
##  5 Afghanistan  1972 13079460 Asia         36.1      740.
##  6 Afghanistan  1977 14880372 Asia         38.4      786.
##  7 Afghanistan  1982 12881816 Asia         39.9      978.
##  8 Afghanistan  1987 13867957 Asia         40.8      852.
##  9 Afghanistan  1992 16317921 Asia         41.7      649.
## 10 Afghanistan  1997 22227415 Asia         41.8      635.
## # … with 1,694 more rows

The data in the cells is the same, the size of the object is the same. However, the dataframe was converted to a tibble, because a dataframe is not capable of storing grouping information.

Using the group_by() function before summarising makes things more interesting. For example, to find out the total population per continent in 2007, we can do the following:

gapminder %>% 
    filter(year == 2007) %>%
    group_by(continent) %>%
    summarise(pop = sum(pop))
## # A tibble: 5 × 2
##   continent        pop
##   <chr>          <dbl>
## 1 Africa     929539692
## 2 Americas   898871184
## 3 Asia      3811953827
## 4 Europe     586098529
## 5 Oceania     24549947

Challenge 3 – max life expectancy per country

Group by country, and find out the maximum life expectancy ever recorded for each one.

Hint: ?max

gapminder %>% 
    group_by(country) %>%
    summarise(maxLE = max(lifeExp))
## # A tibble: 142 × 2
##    country     maxLE
##    <chr>       <dbl>
##  1 Afghanistan  43.8
##  2 Albania      76.4
##  3 Algeria      72.3
##  4 Angola       42.7
##  5 Argentina    75.3
##  6 Australia    81.2
##  7 Austria      79.8
##  8 Bahrain      75.6
##  9 Bangladesh   64.1
## 10 Belgium      79.4
## # … with 132 more rows

More examples

Another example of a summary, with a different dataset that dplyr provides:

starwars %>%
  group_by(species) %>%
  summarise(
    n = n(), # this counts the number of rows in each group
    mass = mean(mass, na.rm = TRUE)
  ) %>%
  filter(n > 1) # the mean of a single value is not worth reporting
## # A tibble: 9 × 3
##   species      n  mass
##   <chr>    <int> <dbl>
## 1 Droid        6  69.8
## 2 Gungan       3  74  
## 3 Human       35  82.8
## 4 Kaminoan     2  88  
## 5 Mirialan     2  53.1
## 6 Twi'lek      2  55  
## 7 Wookiee      2 124  
## 8 Zabrak       2  80  
## 9 <NA>         4  48

What next?

More on dplyr:

For further R resources, look at our compilation of resources.