There are a million different ways to do things in R. This isn’t Python, where solutions on StackOverflow get ranked on how “Pythonic” they are. If there’s something you like about another workflow in R, there’s nothing stopping you from using it!
In this case, there are three main camps on analyzing dataframes in R:
“Base R” - “Base R” means using only functions and stuff built into your base R installation. No external packages or fancy stuff. The focus here is on stability from version to version - your code will never break from an update, but performance and usability aren’t always as great.
data.table
- data.table
is a dataframe manipulation package known to have very good performance.
“The tidyverse” - The “tidyverse” is a collection of packages that overhauls just about everything in R to use a consistent API. Has comparable performance with data.table
.
For much of the rest of this tutorial, we’ll focus on doing things the “tidyverse” way (with a few exceptions). The biggest reasons is that everything follows a consistent API - everything in the tidyverse works well together. You can often guess how to use a new function because you’ve used others like it. It’s also got pretty great performance. When you use stuff from the tidyverse, you can be reasonably confident that someone has already taken a look at optimizing things to speed things along.
So far, we’ve covered how to extract certain pieces of data via indexing. But what we’ve shown so far only works if we know the exact index of the data we want (vector[42]
, for example). There is a neat trick to extra certain pieces of data in R known as “logical indexing”.
Before we start, we need to know a little about comparing things.
==
is the equality operator in R.
1 == 1
## [1] TRUE
!
means “not”. Not TRUE
is FALSE
.
!TRUE
## [1] FALSE
Likewise we can check if something is not equal to something else with !=
TRUE != TRUE
## [1] FALSE
We can also make comparisons with the greater than >
and less than <
symbols. Pairing these with an equals sign means “greater than or equal to” (>=
) or “less than or equal to” (<=
).
4 < 5
## [1] TRUE
5 <= 5
## [1] TRUE
9 > 999
## [1] FALSE
TRUE >= FALSE
## [1] TRUE
The last example worked because TRUE
and FALSE
are equal to 1 and 0, respectively.
TRUE == 1
## [1] TRUE
FALSE == 1
## [1] FALSE
We can even compare strings:
"a" == "a"
## [1] TRUE
"a" != "b"
## [1] TRUE
This trick also works with vectors, returning TRUE
or FALSE
for every element in the vector.
example <- 1:7
example >= 4
## [1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE
another_example <- c("apple", "banana", "banana")
another_example == "banana"
## [1] FALSE TRUE TRUE
This trick is extremely useful for getting specific elements. Watch what happens when we index a vector using a set of boolean values. Using our example from above:
example
## [1] 1 2 3 4 5 6 7
greater_than_3 <- example > 3
greater_than_3
## [1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE
example[greater_than_3]
## [1] 4 5 6 7
This can be turned into a one-liner by putting the boolean expression inside the square brackets.
example[example > 3]
## [1] 4 5 6 7
We can also get the elements which were not greater than 3 by adding an !
in front.
example[!example > 3]
## [1] 1 2 3
Logical indexing is also a pretty neat trick for removing NAs from a vector. Many functions will refuse to work on data with NAs present. The is.na() function returns TRUE or FALSE depending on if a value is NA.
Using this info, make the following return a number as a result instead of NA:
ugly_data <- c(1, NA, 5, 7, NA, NA) mean(ugly_data)
Many functions have an "na.rm" argument used to ignore NA values. Does this work for mean() in the previous example?
Let’s try this out on a bigger dataset. nycflights13
is an example dataset containing all outbound flights from NYC in 2013. You can get this dataset with install.packages("nycflights13")
.
Let’s take a look at the dataset and see what we’ve got.
library(nycflights13)
head(flights) # shows the top few rows of a dataset
## # A tibble: 6 x 19
## year month day dep_t… sche… dep_… arr_… sche… arr_… carr… flig… tail…
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int> <chr>
## 1 2013 1 1 517 515 2.00 830 819 11.0 UA 1545 N142…
## 2 2013 1 1 533 529 4.00 850 830 20.0 UA 1714 N242…
## 3 2013 1 1 542 540 2.00 923 850 33.0 AA 1141 N619…
## 4 2013 1 1 544 545 -1.00 1004 1022 -18.0 B6 725 N804…
## 5 2013 1 1 554 600 -6.00 812 837 -25.0 DL 461 N668…
## 6 2013 1 1 554 558 -4.00 740 728 12.0 UA 1696 N394…
## # ... with 7 more variables: origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
str(flights)
## Classes 'tbl_df', 'tbl' and 'data.frame': 336776 obs. of 19 variables:
## $ year : int 2013 2013 2013 2013 2013 2013 2013 2013 2013 2013 ...
## $ month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ day : int 1 1 1 1 1 1 1 1 1 1 ...
## $ dep_time : int 517 533 542 544 554 554 555 557 557 558 ...
## $ sched_dep_time: int 515 529 540 545 600 558 600 600 600 600 ...
## $ dep_delay : num 2 4 2 -1 -6 -4 -5 -3 -3 -2 ...
## $ arr_time : int 830 850 923 1004 812 740 913 709 838 753 ...
## $ sched_arr_time: int 819 830 850 1022 837 728 854 723 846 745 ...
## $ arr_delay : num 11 20 33 -18 -25 12 19 -14 -8 8 ...
## $ carrier : chr "UA" "UA" "AA" "B6" ...
## $ flight : int 1545 1714 1141 725 461 1696 507 5708 79 301 ...
## $ tailnum : chr "N14228" "N24211" "N619AA" "N804JB" ...
## $ origin : chr "EWR" "LGA" "JFK" "JFK" ...
## $ dest : chr "IAH" "IAH" "MIA" "BQN" ...
## $ air_time : num 227 227 160 183 116 150 158 53 140 138 ...
## $ distance : num 1400 1416 1089 1576 762 ...
## $ hour : num 5 5 5 5 6 5 6 6 6 6 ...
## $ minute : num 15 29 40 45 0 58 0 0 0 0 ...
## $ time_hour : POSIXct, format: "2013-01-01 05:00:00" "2013-01-01 05:00:00" ...
dim(flights)
## [1] 336776 19
"flights" is an example of a "tibble" or "tbl_df". tbl_dfs are identical to dataframes for most purposes, but they print out differently (notice how we didnt't get all of the columns!).
class(flights)
## [1] "tbl_df" "tbl" "data.frame"
To force a `tbl_df` to print all columns, you can use `print(some_tbl_df, width=Inf)`
If we ever get annoyed with a `tbl_df`, we can turn it back into a dataframe with `as.data.frame()`.
class(as.data.frame(flights))
## [1] "data.frame"
The flights
table clocks in at several hundred thousand rows. That’s a fair sized chunk of data. Nevertheless, our tricks from before work just the same.
Using the same technique from before, let’s retrieve all of the flights that went to Los Angeles (LAX).
rows_with_yvr <- flights$dest == "LAX"
flights[rows_with_yvr, ]
## # A tibble: 16,174 x 19
## year month day dep_t… sched_… dep_d… arr_… sched… arr_d… carr… flig…
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int>
## 1 2013 1 1 558 600 - 2.00 924 917 7.00 UA 194
## 2 2013 1 1 628 630 - 2.00 1016 947 29.0 UA 1665
## 3 2013 1 1 658 700 - 2.00 1027 1025 2.00 VX 399
## 4 2013 1 1 702 700 2.00 1058 1014 44.0 B6 671
## 5 2013 1 1 743 730 13.0 1107 1100 7.00 AA 33
## 6 2013 1 1 828 823 5.00 1150 1143 7.00 UA 1506
## 7 2013 1 1 829 830 - 1.00 1152 1200 - 8.00 UA 443
## 8 2013 1 1 856 900 - 4.00 1226 1220 6.00 AA 1
## 9 2013 1 1 859 900 - 1.00 1223 1225 - 2.00 VX 407
## 10 2013 1 1 921 900 21.0 1237 1227 10.0 DL 120
## # ... with 16,164 more rows, and 8 more variables: tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
# and the same, but in one line
result <- flights[flights$dest == "LAX", ]
# checking our work... we should only see "LAX" here
unique(result$dest)
## [1] "LAX"
# how many results did we get
nrow(result)
## [1] 16174
Breaking things apart, we look for all instances where the column dest
was equal to “LAX”. We end up with a vector of whether or not “LAX” was found in each row. We can then use the square brackets to extract every row where the vector is true. Note the addition of a comma in our square brackets. flights
has 2 dimensions, so our indexing needs to as well!
If we don’t add the comma, R gets upset:
flights[flights$dest]
Error: Length of logical index vector must be 1 or 19 (the number of rows), not 336776
One other issue - what happens if we want to grab the flights to either LAX or SEA (Seattle). Let’s try the following:
result <- flights[flights$dest == c("LAX", "SEA"), ]
unique(result$dest)
## [1] "LAX" "SEA"
nrow(result)
## [1] 10060
Though in both cases we got results corresponding to the cities we wanted, it looks like somethig went wrong. Before, we got 16174 results for just “LAX”. Now we only get 10060, and we even added an extra city worth of flights! So what’s happening here?
When R compares two vectors of different length, it “recycles” the shorter vector until it matches the length of the longer one!
Using a smaller example, this is what just happened:
long <- c(1, 1, 1, 2, 2, 2, 3)
short <- c(1, 2)
long == short
## Warning in long == short: longer object length is not a multiple of shorter
## object length
## [1] TRUE FALSE TRUE TRUE FALSE TRUE FALSE
# what R is really doing behind the scenes
short_recycled <- c(1, 2, 1, 2, 1, 2, 1)
long == short_recycled
## [1] TRUE FALSE TRUE TRUE FALSE TRUE FALSE
This is not what we want. We want to know if elements in the long vector were found “in” the shorter vector, not whether or not the two are equal at every point. Fortunately, there is a special %in%
operator that does just that.
long %in% short
## [1] TRUE TRUE TRUE TRUE TRUE TRUE FALSE
# and using that to subset values
long[long %in% short]
## [1] 1 1 1 2 2 2
If we take the %in%
operator and apply it to our issue, we get the correct number of rows.
res <- flights[flights$dest %in% c("SEA", "LAX"), ]
nrow(res)
## [1] 20097
# our results contain the same number of flights bound for LAX
nrow(res[flights$dest == "LAX", ])
## Warning: Length of logical index must be 1 or 20097, not 336776
## [1] 16174
dplyr
Up to this point, we’ve done everything using base R. Our code has a lot of crazy symbols in it, and isn’t that readable for the average person. It’s also not that fun to type out.
Let’s try things the “tidyverse” way using dplyr
(dplyr
is a package that comes as part of the tidyverse
package bundle).
To filter out a set of specific rows that match a condition, we use the filter()
function. The syntax of this function is a bit unusual:
library(tidyverse)
## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1 ✔ readr 1.1.1
## ✔ tibble 1.4.1 ✔ dplyr 0.7.4
## ✔ tidyr 0.7.2 ✔ forcats 0.2.0
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
results <- filter(flights, dest == "LAX")
nrow(results)
## [1] 16174
Notice how we just used dest
all by itself. filter()
is smart enough to figure out that dest
is a column name in the flights
dataframe.
We can also filter multiple things at once using the &
(AND) and |
(OR) operators. &
checks if both conditions are true, |
checks if just one condition is true:
TRUE & TRUE
## [1] TRUE
TRUE & FALSE
## [1] FALSE
TRUE | FALSE
## [1] TRUE
Using this in an example with filter()
to fetch all the flights to LAX in February:
filter(flights, dest == "LAX" & month == 2)
## # A tibble: 1,030 x 19
## year month day dep_t… sched_… dep_d… arr_… sched… arr_d… carr… flig…
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int>
## 1 2013 2 1 554 601 - 7.00 920 918 2.00 UA 1030
## 2 2013 2 1 654 700 - 6.00 1032 1034 - 2.00 DL 763
## 3 2013 2 1 657 705 - 8.00 1027 1035 - 8.00 VX 399
## 4 2013 2 1 658 700 - 2.00 1018 1027 - 9.00 B6 671
## 5 2013 2 1 722 705 17.0 1040 1018 22.0 UA 298
## 6 2013 2 1 807 730 37.0 1134 1100 34.0 AA 33
## 7 2013 2 1 826 830 - 4.00 1206 1154 12.0 UA 112
## 8 2013 2 1 857 900 - 3.00 1225 1227 - 2.00 DL 120
## 9 2013 2 1 859 900 - 1.00 1251 1220 31.0 AA 1
## 10 2013 2 1 901 905 - 4.00 1230 1235 - 5.00 VX 407
## # ... with 1,020 more rows, and 8 more variables: tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
Let's do several more examples to make sure you're super comfortable with filtering data:
The tidyverse heavily encourages the use of a special pipe (%>%
operator). The pipe sends the output of the last command to the first argument of the next (probably will be a familiar concept for users of bash, the Linux shell). This is a great tool for making our analyses more readable (read: good).
Repeating an earlier example, we can retrieve the number of flights that went to LAX with:
# earlier example:
# nrow(filter(flights, dest == "LAX"))
flights %>% filter(dest == "LAX") %>% nrow
## [1] 16174
Our analysis now flows from left to right, instead of inside out. Makes things quite a bit more readable. Many people also put each step on a new line. That way if you want to exclude a step, you can just comment it out.
flights %>%
filter(dest == "LAX") %>%
nrow()
## [1] 16174
dplyr
also has its own function for selecting columns: select()
. To grab the certain columns from a dataframe, we supply their names to select()
as arguments.
flights %>% select(flight, dest, air_time)
## # A tibble: 336,776 x 3
## flight dest air_time
## <int> <chr> <dbl>
## 1 1545 IAH 227
## 2 1714 IAH 227
## 3 1141 MIA 160
## 4 725 BQN 183
## 5 461 ATL 116
## 6 1696 ORD 150
## 7 507 FLL 158
## 8 5708 IAD 53.0
## 9 79 MCO 140
## 10 301 ORD 138
## # ... with 336,766 more rows
We can also sort columns using arrange()
. arrange()
sorts a dataset by whatever column names you specify.
flights %>% arrange(sched_dep_time)
## # A tibble: 336,776 x 19
## year month day dep_t… sched_… dep_d… arr_… sched… arr_d… carr… flig…
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int>
## 1 2013 7 27 NA 106 NA NA 245 NA US 1632
## 2 2013 1 2 458 500 - 2.00 703 650 13.0 US 1030
## 3 2013 1 3 458 500 - 2.00 650 650 0 US 1030
## 4 2013 1 4 456 500 - 4.00 631 650 -19.0 US 1030
## 5 2013 1 5 458 500 - 2.00 640 650 -10.0 US 1030
## 6 2013 1 6 458 500 - 2.00 718 650 28.0 US 1030
## 7 2013 1 7 454 500 - 6.00 637 648 -11.0 US 1117
## 8 2013 1 8 454 500 - 6.00 625 648 -23.0 US 1117
## 9 2013 1 9 457 500 - 3.00 647 648 - 1.00 US 1117
## 10 2013 1 10 450 500 -10.0 634 648 -14.0 US 1117
## # ... with 336,766 more rows, and 8 more variables: tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
To sort in descending order, we can add the desc()
function into the mix.
flights %>% arrange(desc(sched_dep_time))
## # A tibble: 336,776 x 19
## year month day dep_t… sched… dep_del… arr_… sche… arr_d… carr… flig…
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr> <int>
## 1 2013 1 1 2353 2359 - 6.00 425 445 - 20.0 B6 739
## 2 2013 1 1 2353 2359 - 6.00 418 442 - 24.0 B6 707
## 3 2013 1 1 2356 2359 - 3.00 425 437 - 12.0 B6 727
## 4 2013 1 2 42 2359 43.0 518 442 36.0 B6 707
## 5 2013 1 2 2351 2359 - 8.00 427 445 - 18.0 B6 739
## 6 2013 1 2 2354 2359 - 5.00 413 437 - 24.0 B6 727
## 7 2013 1 3 32 2359 33.0 504 442 22.0 B6 707
## 8 2013 1 3 235 2359 156 700 437 143 B6 727
## 9 2013 1 3 2349 2359 - 10.0 434 445 - 11.0 B6 739
## 10 2013 1 4 25 2359 26.0 505 442 23.0 B6 707
## # ... with 336,766 more rows, and 8 more variables: tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
So far we’ve learned how to to rearrange and select parts of our data. What about actually analyzing it. The group_by()
and summarize()
functions, allow us to group by a certain column (say, city or airline), and then perform an operation on every group.
A simple example might be grouping by month and then summarizing by the number of flights (rows) in each group.
flights %>%
group_by(month) %>%
summarize(length(month)) # number of records in a group
## # A tibble: 12 x 2
## month `length(month)`
## <int> <int>
## 1 1 27004
## 2 2 24951
## 3 3 28834
## 4 4 28330
## 5 5 28796
## 6 6 28243
## 7 7 29425
## 8 8 29327
## 9 9 27574
## 10 10 28889
## 11 11 27268
## 12 12 28135
We can also perform multiple “summarizations” at once and name our columns something informative.
flights %>%
group_by(month) %>%
summarize(num_flights=length(month),
avg_flight_time=mean(air_time, na.rm=TRUE))
## # A tibble: 12 x 3
## month num_flights avg_flight_time
## <int> <int> <dbl>
## 1 1 27004 154
## 2 2 24951 151
## 3 3 28834 149
## 4 4 28330 153
## 5 5 28796 146
## 6 6 28243 150
## 7 7 29425 147
## 8 8 29327 148
## 9 9 27574 143
## 10 10 28889 149
## 11 11 27268 155
## 12 12 28135 163
We can also simply add on a column to a dataset with the mutate()
function. This is the equivalent of cats$age <- c(1, 3, 4)
like we did earlier.
colnames(flights)
## [1] "year" "month" "day" "dep_time"
## [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
## [9] "arr_delay" "carrier" "flight" "tailnum"
## [13] "origin" "dest" "air_time" "distance"
## [17] "hour" "minute" "time_hour"
new_flights <- flights %>%
mutate(plane_speed = distance / air_time)
colnames(flights)
## [1] "year" "month" "day" "dep_time"
## [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
## [9] "arr_delay" "carrier" "flight" "tailnum"
## [13] "origin" "dest" "air_time" "distance"
## [17] "hour" "minute" "time_hour"
Which airline has the worst record in terms of delays?
To do this, group our data by carrier, get the average arrival delay for each group, then sort in descending order so that the worst offenders are at the top.
Get the maximum arrival delay in the dataset. You'll want to use the `max()` function. Did you need to use `dplyr`?
In terms of some data, the flights
table is actually incomplete! What if we wanted to match up the destination airport acronyms to their details (like airports’ full names)? This data is actually in another table: airports
.
head(airports)
## # A tibble: 6 x 8
## faa name lat lon alt tz dst tzone
## <chr> <chr> <dbl> <dbl> <int> <dbl> <chr> <chr>
## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5.00 A Amer…
## 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6.00 A Amer…
## 3 06C Schaumburg Regional 42.0 -88.1 801 -6.00 A Amer…
## 4 06N Randall Airport 41.4 -74.4 523 -5.00 A Amer…
## 5 09J Jekyll Island Airport 31.1 -81.4 11 -5.00 A Amer…
## 6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5.00 A Amer…
In order for this information to be useful to us, we need to match it up and “join” it to our flights table. This is a pretty complex operation in base R, but dplyr
makes it relatively easy.
There are a lot of different types of joins that put together data in different ways. In this case, we’re going to do what’s called a “left join”: one table is on the left side, and we’ll keep all of its data. However, on the right side (the table we are joining), we’ll only match up and add each entry if there is a corresponding entry on the left side.
colnames(flights)
## [1] "year" "month" "day" "dep_time"
## [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
## [9] "arr_delay" "carrier" "flight" "tailnum"
## [13] "origin" "dest" "air_time" "distance"
## [17] "hour" "minute" "time_hour"
colnames(airports)
## [1] "faa" "name" "lat" "lon" "alt" "tz" "dst" "tzone"
# join syntax:
# left_join(left_table, right_table, by=c("left_colname" = "right_colname"))
# the "by" argument controls which columns in each table are matched up
joined <- left_join(flights, airports, by=c("dest" = "faa"))
colnames(joined) # joined now contain columns from both
## [1] "year" "month" "day" "dep_time"
## [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
## [9] "arr_delay" "carrier" "flight" "tailnum"
## [13] "origin" "dest" "air_time" "distance"
## [17] "hour" "minute" "time_hour" "name"
## [21] "lat" "lon" "alt" "tz"
## [25] "dst" "tzone"
Let’s check our work. SEA should show up as Seattle-Tacoma International Airport. Note: we can use .
as a placeholder to represent the entire object passed to the summarize function (instead of using just a column name, for instance).
joined %>%
filter(dest == "SEA") %>%
select(name) %>%
head(n=1)
## # A tibble: 1 x 1
## name
## <chr>
## 1 Seattle Tacoma Intl
Looks like our join worked!
Find the name of the airline with the biggest arrival delays. You will need to join the `airlines` table to the `flights` table. A suggested workflow is shown below (feel free to reuse code from earlier).
Write your results from the last problem to a file. Use the `write_csv()` to write the table to a csv file. You can use `?write_csv()` to look up how to use this function.
© Jeff Stafford // https://jstaf.github.io/r-data-science/