Data analysis with dplyr

About the rest of this tutorial

There are a million different ways to do things in R. This isn’t Python, where solutions on StackOverflow get ranked on how “Pythonic” they are. If there’s something you like about another workflow in R, there’s nothing stopping you from using it!

In this case, there are three main camps on analyzing dataframes in R:

“Base R” - “Base R” means using only functions and stuff built into your base R installation. No external packages or fancy stuff. The focus here is on stability from version to version - your code will never break from an update, but performance and usability aren’t always as great.
data.table - data.table is a dataframe manipulation package known to have very good performance.
“The tidyverse” - The “tidyverse” is a collection of packages that overhauls just about everything in R to use a consistent API. Has comparable performance with data.table.

For much of the rest of this tutorial, we’ll focus on doing things the “tidyverse” way (with a few exceptions). The biggest reasons is that everything follows a consistent API - everything in the tidyverse works well together. You can often guess how to use a new function because you’ve used others like it. It’s also got pretty great performance. When you use stuff from the tidyverse, you can be reasonably confident that someone has already taken a look at optimizing things to speed things along.

Logical indexing

So far, we’ve covered how to extract certain pieces of data via indexing. But what we’ve shown so far only works if we know the exact index of the data we want (vector[42], for example). There is a neat trick to extra certain pieces of data in R known as “logical indexing”.

Before we start, we need to know a little about comparing things.

== is the equality operator in R.

1 == 1

## [1] TRUE

! means “not”. Not TRUE is FALSE.

!TRUE

## [1] FALSE

Likewise we can check if something is not equal to something else with !=

TRUE != TRUE

## [1] FALSE

We can also make comparisons with the greater than > and less than < symbols. Pairing these with an equals sign means “greater than or equal to” (>=) or “less than or equal to” (<=).

4 < 5

## [1] TRUE

5 <= 5

## [1] TRUE

9 > 999

## [1] FALSE

TRUE >= FALSE

## [1] TRUE

The last example worked because TRUE and FALSE are equal to 1 and 0, respectively.

TRUE == 1

## [1] TRUE

FALSE == 1

## [1] FALSE

We can even compare strings:

"a" == "a"

## [1] TRUE

"a" != "b"

## [1] TRUE

This trick also works with vectors, returning TRUE or FALSE for every element in the vector.

example <- 1:7
example >= 4

## [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE

another_example <- c("apple", "banana", "banana")
another_example == "banana"

## [1] FALSE  TRUE  TRUE

This trick is extremely useful for getting specific elements. Watch what happens when we index a vector using a set of boolean values. Using our example from above:

example

## [1] 1 2 3 4 5 6 7

greater_than_3 <- example > 3
greater_than_3

## [1] FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE

example[greater_than_3]

## [1] 4 5 6 7

This can be turned into a one-liner by putting the boolean expression inside the square brackets.

example[example > 3]

## [1] 4 5 6 7

We can also get the elements which were not greater than 3 by adding an ! in front.

example[!example > 3]

## [1] 1 2 3

Exercise - Removing NAs from a dataset

Logical indexing is also a pretty neat trick for removing NAs from a vector. Many functions will refuse to work on data with NAs present. The is.na() function returns TRUE or FALSE depending on if a value is NA.

Using this info, make the following return a number as a result instead of NA:

ugly_data <- c(1, NA, 5, 7, NA, NA)
mean(ugly_data)

Exercise - The "na.rm" argument

Many functions have an "na.rm" argument used to ignore NA values. Does this work for mean() in the previous example?

Retrieving rows from dataframes

Let’s try this out on a bigger dataset. nycflights13 is an example dataset containing all outbound flights from NYC in 2013. You can get this dataset with install.packages("nycflights13").

Let’s take a look at the dataset and see what we’ve got.

library(nycflights13)
head(flights)  # shows the top few rows of a dataset

## # A tibble: 6 x 19
##    year month   day dep_t… sche… dep_… arr_… sche… arr_… carr… flig… tail…
##   <int> <int> <int>  <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr>
## 1  2013     1     1    517   515  2.00   830   819  11.0 UA     1545 N142…
## 2  2013     1     1    533   529  4.00   850   830  20.0 UA     1714 N242…
## 3  2013     1     1    542   540  2.00   923   850  33.0 AA     1141 N619…
## 4  2013     1     1    544   545 -1.00  1004  1022 -18.0 B6      725 N804…
## 5  2013     1     1    554   600 -6.00   812   837 -25.0 DL      461 N668…
## 6  2013     1     1    554   558 -4.00   740   728  12.0 UA     1696 N394…
## # ... with 7 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

str(flights)

## Classes 'tbl_df', 'tbl' and 'data.frame':    336776 obs. of  19 variables:
##  $ year          : int  2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
##  $ month         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time      : int  517 533 542 544 554 554 555 557 557 558 ...
##  $ sched_dep_time: int  515 529 540 545 600 558 600 600 600 600 ...
##  $ dep_delay     : num  2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
##  $ arr_time      : int  830 850 923 1004 812 740 913 709 838 753 ...
##  $ sched_arr_time: int  819 830 850 1022 837 728 854 723 846 745 ...
##  $ arr_delay     : num  11 20 33 -18 -25 12 19 -14 -8 8 ...
##  $ carrier       : chr  "UA" "UA" "AA" "B6" ...
##  $ flight        : int  1545 1714 1141 725 461 1696 507 5708 79 301 ...
##  $ tailnum       : chr  "N14228" "N24211" "N619AA" "N804JB" ...
##  $ origin        : chr  "EWR" "LGA" "JFK" "JFK" ...
##  $ dest          : chr  "IAH" "IAH" "MIA" "BQN" ...
##  $ air_time      : num  227 227 160 183 116 150 158 53 140 138 ...
##  $ distance      : num  1400 1416 1089 1576 762 ...
##  $ hour          : num  5 5 5 5 6 5 6 6 6 6 ...
##  $ minute        : num  15 29 40 45 0 58 0 0 0 0 ...
##  $ time_hour     : POSIXct, format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...

dim(flights)

## [1] 336776     19

A note about tbl_dfs

"flights" is an example of a "tibble" or "tbl_df". tbl_dfs are identical to dataframes for most purposes, but they print out differently (notice how we didnt't get all of the columns!).

class(flights)

## [1] "tbl_df"     "tbl"        "data.frame"

To force a `tbl_df` to print all columns, you can use `print(some_tbl_df, width=Inf)`

If we ever get annoyed with a `tbl_df`, we can turn it back into a dataframe with `as.data.frame()`.

class(as.data.frame(flights))

## [1] "data.frame"

The flights table clocks in at several hundred thousand rows. That’s a fair sized chunk of data. Nevertheless, our tricks from before work just the same.

Using the same technique from before, let’s retrieve all of the flights that went to Los Angeles (LAX).

rows_with_yvr <- flights$dest == "LAX"
flights[rows_with_yvr, ]

## # A tibble: 16,174 x 19
##     year month   day dep_t… sched_… dep_d… arr_… sched… arr_d… carr… flig…
##    <int> <int> <int>  <int>   <int>  <dbl> <int>  <int>  <dbl> <chr> <int>
##  1  2013     1     1    558     600 - 2.00   924    917   7.00 UA      194
##  2  2013     1     1    628     630 - 2.00  1016    947  29.0  UA     1665
##  3  2013     1     1    658     700 - 2.00  1027   1025   2.00 VX      399
##  4  2013     1     1    702     700   2.00  1058   1014  44.0  B6      671
##  5  2013     1     1    743     730  13.0   1107   1100   7.00 AA       33
##  6  2013     1     1    828     823   5.00  1150   1143   7.00 UA     1506
##  7  2013     1     1    829     830 - 1.00  1152   1200 - 8.00 UA      443
##  8  2013     1     1    856     900 - 4.00  1226   1220   6.00 AA        1
##  9  2013     1     1    859     900 - 1.00  1223   1225 - 2.00 VX      407
## 10  2013     1     1    921     900  21.0   1237   1227  10.0  DL      120
## # ... with 16,164 more rows, and 8 more variables: tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

# and the same, but in one line
result <- flights[flights$dest == "LAX", ]
# checking our work... we should only see "LAX" here
unique(result$dest)

## [1] "LAX"

# how many results did we get
nrow(result)

## [1] 16174

Breaking things apart, we look for all instances where the column dest was equal to “LAX”. We end up with a vector of whether or not “LAX” was found in each row. We can then use the square brackets to extract every row where the vector is true. Note the addition of a comma in our square brackets. flights has 2 dimensions, so our indexing needs to as well!

If we don’t add the comma, R gets upset:

flights[flights$dest]

Error: Length of logical index vector must be 1 or 19 (the number of rows), not 336776

One other issue - what happens if we want to grab the flights to either LAX or SEA (Seattle). Let’s try the following:

result <- flights[flights$dest == c("LAX", "SEA"), ]
unique(result$dest)

## [1] "LAX" "SEA"

nrow(result)

## [1] 10060

Though in both cases we got results corresponding to the cities we wanted, it looks like somethig went wrong. Before, we got 16174 results for just “LAX”. Now we only get 10060, and we even added an extra city worth of flights! So what’s happening here?

When R compares two vectors of different length, it “recycles” the shorter vector until it matches the length of the longer one!

Using a smaller example, this is what just happened:

long <- c(1, 1, 1, 2, 2, 2, 3)
short <- c(1, 2)
long == short

## Warning in long == short: longer object length is not a multiple of shorter
## object length

## [1]  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE

# what R is really doing behind the scenes
short_recycled <- c(1, 2, 1, 2, 1, 2, 1)
long == short_recycled

## [1]  TRUE FALSE  TRUE  TRUE FALSE  TRUE FALSE

This is not what we want. We want to know if elements in the long vector were found “in” the shorter vector, not whether or not the two are equal at every point. Fortunately, there is a special %in% operator that does just that.

long %in% short

## [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE

# and using that to subset values
long[long %in% short]

## [1] 1 1 1 2 2 2

If we take the %in% operator and apply it to our issue, we get the correct number of rows.

res <- flights[flights$dest %in% c("SEA", "LAX"), ]
nrow(res)

## [1] 20097

# our results contain the same number of flights bound for LAX
nrow(res[flights$dest == "LAX", ])

## Warning: Length of logical index must be 1 or 20097, not 336776

## [1] 16174

Filtering rows with `dplyr`

Up to this point, we’ve done everything using base R. Our code has a lot of crazy symbols in it, and isn’t that readable for the average person. It’s also not that fun to type out.

Let’s try things the “tidyverse” way using dplyr (dplyr is a package that comes as part of the tidyverse package bundle).

To filter out a set of specific rows that match a condition, we use the filter() function. The syntax of this function is a bit unusual:

library(tidyverse)

## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 2.2.1     ✔ readr   1.1.1
## ✔ tibble  1.4.1     ✔ dplyr   0.7.4
## ✔ tidyr   0.7.2     ✔ forcats 0.2.0

## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

results <- filter(flights, dest == "LAX")
nrow(results)

## [1] 16174

Notice how we just used dest all by itself. filter() is smart enough to figure out that dest is a column name in the flights dataframe.

We can also filter multiple things at once using the & (AND) and | (OR) operators. & checks if both conditions are true, | checks if just one condition is true:

TRUE & TRUE

## [1] TRUE

TRUE & FALSE

## [1] FALSE

TRUE | FALSE

## [1] TRUE

Using this in an example with filter() to fetch all the flights to LAX in February:

filter(flights, dest == "LAX" & month == 2)

## # A tibble: 1,030 x 19
##     year month   day dep_t… sched_… dep_d… arr_… sched… arr_d… carr… flig…
##    <int> <int> <int>  <int>   <int>  <dbl> <int>  <int>  <dbl> <chr> <int>
##  1  2013     2     1    554     601 - 7.00   920    918   2.00 UA     1030
##  2  2013     2     1    654     700 - 6.00  1032   1034 - 2.00 DL      763
##  3  2013     2     1    657     705 - 8.00  1027   1035 - 8.00 VX      399
##  4  2013     2     1    658     700 - 2.00  1018   1027 - 9.00 B6      671
##  5  2013     2     1    722     705  17.0   1040   1018  22.0  UA      298
##  6  2013     2     1    807     730  37.0   1134   1100  34.0  AA       33
##  7  2013     2     1    826     830 - 4.00  1206   1154  12.0  UA      112
##  8  2013     2     1    857     900 - 3.00  1225   1227 - 2.00 DL      120
##  9  2013     2     1    859     900 - 1.00  1251   1220  31.0  AA        1
## 10  2013     2     1    901     905 - 4.00  1230   1235 - 5.00 VX      407
## # ... with 1,020 more rows, and 8 more variables: tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Exercise - Filtering data

Let's do several more examples to make sure you're super comfortable with filtering data:

How many flights left before 6 AM?
How many flights went to Toronto (YYZ)? Is there anything weird about this dataset?
What is a typical flight time (air time) when traveling from New York to Chicago O'Hare (ORD)?

Using the “pipe”

The tidyverse heavily encourages the use of a special pipe (%>% operator). The pipe sends the output of the last command to the first argument of the next (probably will be a familiar concept for users of bash, the Linux shell). This is a great tool for making our analyses more readable (read: good).

Repeating an earlier example, we can retrieve the number of flights that went to LAX with:

# earlier example:
# nrow(filter(flights, dest == "LAX"))

flights %>% filter(dest == "LAX") %>% nrow

## [1] 16174

Our analysis now flows from left to right, instead of inside out. Makes things quite a bit more readable. Many people also put each step on a new line. That way if you want to exclude a step, you can just comment it out.

flights %>%
    filter(dest == "LAX") %>%
    nrow()

## [1] 16174

Controlling output

dplyr also has its own function for selecting columns: select(). To grab the certain columns from a dataframe, we supply their names to select() as arguments.

flights %>% select(flight, dest, air_time)

## # A tibble: 336,776 x 3
##    flight dest  air_time
##     <int> <chr>    <dbl>
##  1   1545 IAH      227  
##  2   1714 IAH      227  
##  3   1141 MIA      160  
##  4    725 BQN      183  
##  5    461 ATL      116  
##  6   1696 ORD      150  
##  7    507 FLL      158  
##  8   5708 IAD       53.0
##  9     79 MCO      140  
## 10    301 ORD      138  
## # ... with 336,766 more rows

We can also sort columns using arrange(). arrange() sorts a dataset by whatever column names you specify.

flights %>% arrange(sched_dep_time)

## # A tibble: 336,776 x 19
##     year month   day dep_t… sched_… dep_d… arr_… sched… arr_d… carr… flig…
##    <int> <int> <int>  <int>   <int>  <dbl> <int>  <int>  <dbl> <chr> <int>
##  1  2013     7    27     NA     106  NA       NA    245  NA    US     1632
##  2  2013     1     2    458     500 - 2.00   703    650  13.0  US     1030
##  3  2013     1     3    458     500 - 2.00   650    650   0    US     1030
##  4  2013     1     4    456     500 - 4.00   631    650 -19.0  US     1030
##  5  2013     1     5    458     500 - 2.00   640    650 -10.0  US     1030
##  6  2013     1     6    458     500 - 2.00   718    650  28.0  US     1030
##  7  2013     1     7    454     500 - 6.00   637    648 -11.0  US     1117
##  8  2013     1     8    454     500 - 6.00   625    648 -23.0  US     1117
##  9  2013     1     9    457     500 - 3.00   647    648 - 1.00 US     1117
## 10  2013     1    10    450     500 -10.0    634    648 -14.0  US     1117
## # ... with 336,766 more rows, and 8 more variables: tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

To sort in descending order, we can add the desc() function into the mix.

flights %>% arrange(desc(sched_dep_time))

## # A tibble: 336,776 x 19
##     year month   day dep_t… sched… dep_del… arr_… sche… arr_d… carr… flig…
##    <int> <int> <int>  <int>  <int>    <dbl> <int> <int>  <dbl> <chr> <int>
##  1  2013     1     1   2353   2359  -  6.00   425   445 - 20.0 B6      739
##  2  2013     1     1   2353   2359  -  6.00   418   442 - 24.0 B6      707
##  3  2013     1     1   2356   2359  -  3.00   425   437 - 12.0 B6      727
##  4  2013     1     2     42   2359    43.0    518   442   36.0 B6      707
##  5  2013     1     2   2351   2359  -  8.00   427   445 - 18.0 B6      739
##  6  2013     1     2   2354   2359  -  5.00   413   437 - 24.0 B6      727
##  7  2013     1     3     32   2359    33.0    504   442   22.0 B6      707
##  8  2013     1     3    235   2359   156      700   437  143   B6      727
##  9  2013     1     3   2349   2359  - 10.0    434   445 - 11.0 B6      739
## 10  2013     1     4     25   2359    26.0    505   442   23.0 B6      707
## # ... with 336,766 more rows, and 8 more variables: tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

Data analysis

So far we’ve learned how to to rearrange and select parts of our data. What about actually analyzing it. The group_by() and summarize() functions, allow us to group by a certain column (say, city or airline), and then perform an operation on every group.

A simple example might be grouping by month and then summarizing by the number of flights (rows) in each group.

flights %>%
    group_by(month) %>%
    summarize(length(month))  # number of records in a group

## # A tibble: 12 x 2
##    month `length(month)`
##    <int>           <int>
##  1     1           27004
##  2     2           24951
##  3     3           28834
##  4     4           28330
##  5     5           28796
##  6     6           28243
##  7     7           29425
##  8     8           29327
##  9     9           27574
## 10    10           28889
## 11    11           27268
## 12    12           28135

We can also perform multiple “summarizations” at once and name our columns something informative.

flights %>%
    group_by(month) %>%
    summarize(num_flights=length(month),
              avg_flight_time=mean(air_time, na.rm=TRUE))

## # A tibble: 12 x 3
##    month num_flights avg_flight_time
##    <int>       <int>           <dbl>
##  1     1       27004             154
##  2     2       24951             151
##  3     3       28834             149
##  4     4       28330             153
##  5     5       28796             146
##  6     6       28243             150
##  7     7       29425             147
##  8     8       29327             148
##  9     9       27574             143
## 10    10       28889             149
## 11    11       27268             155
## 12    12       28135             163

We can also simply add on a column to a dataset with the mutate() function. This is the equivalent of cats$age <- c(1, 3, 4) like we did earlier.

colnames(flights)

##  [1] "year"           "month"          "day"            "dep_time"      
##  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
##  [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
## [13] "origin"         "dest"           "air_time"       "distance"      
## [17] "hour"           "minute"         "time_hour"

new_flights <- flights %>%
    mutate(plane_speed = distance / air_time)
colnames(flights)

##  [1] "year"           "month"          "day"            "dep_time"      
##  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
##  [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
## [13] "origin"         "dest"           "air_time"       "distance"      
## [17] "hour"           "minute"         "time_hour"

Exercise - Finding the worst airline

Which airline has the worst record in terms of delays?

To do this, group our data by carrier, get the average arrival delay for each group, then sort in descending order so that the worst offenders are at the top.

Exercise - Picking an analysis method

Get the maximum arrival delay in the dataset. You'll want to use the `max()` function. Did you need to use `dplyr`?

Putting dataframes together

In terms of some data, the flights table is actually incomplete! What if we wanted to match up the destination airport acronyms to their details (like airports’ full names)? This data is actually in another table: airports.

head(airports)

## # A tibble: 6 x 8
##   faa   name                             lat   lon   alt    tz dst   tzone
##   <chr> <chr>                          <dbl> <dbl> <int> <dbl> <chr> <chr>
## 1 04G   Lansdowne Airport               41.1 -80.6  1044 -5.00 A     Amer…
## 2 06A   Moton Field Municipal Airport   32.5 -85.7   264 -6.00 A     Amer…
## 3 06C   Schaumburg Regional             42.0 -88.1   801 -6.00 A     Amer…
## 4 06N   Randall Airport                 41.4 -74.4   523 -5.00 A     Amer…
## 5 09J   Jekyll Island Airport           31.1 -81.4    11 -5.00 A     Amer…
## 6 0A9   Elizabethton Municipal Airport  36.4 -82.2  1593 -5.00 A     Amer…

In order for this information to be useful to us, we need to match it up and “join” it to our flights table. This is a pretty complex operation in base R, but dplyr makes it relatively easy.

There are a lot of different types of joins that put together data in different ways. In this case, we’re going to do what’s called a “left join”: one table is on the left side, and we’ll keep all of its data. However, on the right side (the table we are joining), we’ll only match up and add each entry if there is a corresponding entry on the left side.

colnames(flights)

##  [1] "year"           "month"          "day"            "dep_time"      
##  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
##  [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
## [13] "origin"         "dest"           "air_time"       "distance"      
## [17] "hour"           "minute"         "time_hour"

colnames(airports)

## [1] "faa"   "name"  "lat"   "lon"   "alt"   "tz"    "dst"   "tzone"

# join syntax:
# left_join(left_table, right_table, by=c("left_colname" = "right_colname"))
# the "by" argument controls which columns in each table are matched up
joined <- left_join(flights, airports, by=c("dest" = "faa"))
colnames(joined)  # joined now contain columns from both

##  [1] "year"           "month"          "day"            "dep_time"      
##  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
##  [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
## [13] "origin"         "dest"           "air_time"       "distance"      
## [17] "hour"           "minute"         "time_hour"      "name"          
## [21] "lat"            "lon"            "alt"            "tz"            
## [25] "dst"            "tzone"

Let’s check our work. SEA should show up as Seattle-Tacoma International Airport. Note: we can use . as a placeholder to represent the entire object passed to the summarize function (instead of using just a column name, for instance).

joined %>%
    filter(dest == "SEA") %>%
    select(name) %>%
    head(n=1)

## # A tibble: 1 x 1
##   name               
##   <chr>              
## 1 Seattle Tacoma Intl

Looks like our join worked!

Exercise - Worst airline, part II

Find the name of the airline with the biggest arrival delays. You will need to join the `airlines` table to the `flights` table. A suggested workflow is shown below (feel free to reuse code from earlier).

Calculate the average arrival delays by airline.
Sort the result by average delay in descending order.
Find which columns match up between the `airlines` and `flights` tables. Remember, you can use `print(table_name, width=Inf)` to show all columns!
Join the `airlines` table to the `flights` table based upon their common column.
The top value is your answer

Exercise - Writing output

Write your results from the last problem to a file. Use the `write_csv()` to write the table to a csv file. You can use `?write_csv()` to look up how to use this function.

Next section