Dataframes

Vectors and matrices are super cool. However they don’t address an important issue: holding multiple types of data and working with them at the same time. Dataframes are another special data structure that let’s you handle large amounts and different types of data together. Because of this, they are generally the tool-of-choice for doing analyses in R.

We are going to focus on using dataframes using the dplyr package. dplyr comes as part of the tidyverse package bundle, you can install it with install.packages("tidyverse"). It can take awhile to install this on Linux, so perhaps start the command in another window while we go through the non-dplyr parts.

A small example

In a text editor, create the following example CSV file. We’ll call it cats.csv.

coat,weight,likes_string
calico,2.1,1
black,5.0,0
tabby,3.2,1

Once we’ve saved it in the same directory we’re working in, we can load it with read.csv().

cats <- read.csv('cats.csv')
cats
##     coat weight likes_string
## 1 calico    2.1            1
## 2  black    5.0            0
## 3  tabby    3.2            1

Whenever we import a dataset with multiple types of values, R will autodetect this and make the output a dataframe. Let’s verify this for ourselves:

class(cats)
## [1] "data.frame"

So, we’ve got a dataframe with multiple types of values. How do we work with it? Fortunately, everything we know about vectors also applies to dataframes.

Each column of a dataframe can be used as a vector. We use the $ operator to specify which column we want.

cats$weight + 34
## [1] 36.1 39.0 37.2
class(cats$weight)
## [1] "numeric"
cats$coat
## [1] calico black  tabby 
## Levels: black calico tabby

We can also reassign columns as if they were variables. The cats$likes_string likely represents a set of boolean value, lets update that column to reflect this fact.

class(cats$likes_string)  # before
## [1] "integer"
cats$likes_string <- as.logical(cats$likes_string)
class(cats$likes_string)
## [1] "logical"

We can even add a column if we want!

cats$age <- c(1, 6, 4, 2.5)
Error in `$<-.data.frame`(`*tmp*`, age, value = c(1, 6, 4, 2.5)) : 
  replacement has 4 rows, data has 3

Notice how it won’t let us do that. The reason is that dataframes must have the same number of elements in every column. If each column only has 3 rows, we can’t add another column with 4 rows. Let’s try that again with the proper number of elements.

cats$age <- c(1, 6, 4)
cats
##     coat weight likes_string age
## 1 calico    2.1         TRUE   1
## 2  black    5.0        FALSE   6
## 3  tabby    3.2         TRUE   4

Note that we don’t have to call class() on every single column to figure out what they are. There are a number of useful summary functions to get information about our dataframe.

str() reports on the structure of your dataframe. It is an extremely useful function - use it on everything if you’ve loaded a dataset for the first time.

str(cats)
## 'data.frame':    3 obs. of  4 variables:
##  $ coat        : Factor w/ 3 levels "black","calico",..: 2 1 3
##  $ weight      : num  2.1 5 3.2
##  $ likes_string: logi  TRUE FALSE TRUE
##  $ age         : num  1 6 4

As with matrices, we can use dim() to know how many rows and columns we’re working with.

dim(cats)
## [1] 3 4
nrow(cats)  # number of rows only
## [1] 3
ncol(cats)  # number of columns only
## [1] 4

Factors

When we ran str(cats), you might have noticed something weird. cats$coat is listed as a “factor”. A factor is a special type of data that’s almost a string.

It prints like a string (sort of):

cats$coat
## [1] calico black  tabby 
## Levels: black calico tabby

It can be used like a string:

paste("The cat is", cats$coat)
## [1] "The cat is calico" "The cat is black"  "The cat is tabby"

But it’s not a string! The output of str(cats) gives us a clue to what’s actually happening behind-the-scenes.

str(cats)
## 'data.frame':    3 obs. of  4 variables:
##  $ coat        : Factor w/ 3 levels "black","calico",..: 2 1 3
##  $ weight      : num  2.1 5 3.2
##  $ likes_string: logi  TRUE FALSE TRUE
##  $ age         : num  1 6 4

str() reports that the first values are 2, 1, 3 (and not text). Let’s use as.numeric() to reveal its true form!

as.numeric(cats$coat)
## [1] 2 1 3
cats$coat
## [1] calico black  tabby 
## Levels: black calico tabby

A factor has two components, its levels and its values. Levels represent all possible values for a column. In this case, there’s only 3 possiblities: black, calico and tabby.

The actual values are 2, 1, and 3. Each value matches up to a specific level. So in our example, the first value is 2, which corresponds to the second level, calico. The second value is 1, which matches up with the first level, black.

Factors in R are a method of storing text information as one of several possible “levels”. R converts text to factors automatically when we import data, like from a CSV file. We’ve got several options here:

Convert the factor to a character vector ourselves:

cats$coat <- as.character(cats$coat)
class(cats$coat)
## [1] "character"

Tell R to simply not convert things to factors when we import it (as.is=TRUE is the R equivalent of “don’t touch my stuff!”):

new_cats <- read.csv('cats.csv', as.is=TRUE)
class(new_cats$coat)
## [1] "character"

Use the read_csv() function from the readr package. readr is part of the tidyverse and has a number of ways of reading/writing data with more sensible defaults.

library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
even_newer_cats <- read_csv('cats.csv')
## Parsed with column specification:
## cols(
##   coat = col_character(),
##   weight = col_double(),
##   likes_string = col_integer()
## )
class(even_newer_cats$coat)
## [1] "character"

Performance considerations

As you can see, factors can be kind of a pain to deal with. So why do they even exist? The short answer is that they are an effective way of optimizing memory usage.

To demonstrate this, we’ll examine the gapminder example dataset (install.packages("gapminder")).

library(gapminder)
head(gapminder)
## # A tibble: 6 x 6
##       country continent  year lifeExp      pop gdpPercap
##        <fctr>    <fctr> <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan      Asia  1952  28.801  8425333  779.4453
## 2 Afghanistan      Asia  1957  30.332  9240934  820.8530
## 3 Afghanistan      Asia  1962  31.997 10267083  853.1007
## 4 Afghanistan      Asia  1967  34.020 11537966  836.1971
## 5 Afghanistan      Asia  1972  36.088 13079460  739.9811
## 6 Afghanistan      Asia  1977  38.438 14880372  786.1134
str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

Notice how gapminder contains several columns with a lot of highly repetitive data. For instance, the continent column contains only 5 values:

unique(gapminder$continent)
## [1] Asia     Europe   Africa   Americas Oceania 
## Levels: Africa Americas Asia Europe Oceania

In this case, gapminder$continent has been stored as a factor. Let’s examine the amount of space used if this column was stored as a character vector vs. storing the data as a factor.

library(pryr)
## 
## Attaching package: 'pryr'
## The following objects are masked from 'package:purrr':
## 
##     compose, partial
object_size(gapminder$continent)
## 7.51 kB
object_size(as.character(gapminder$continent))
## 13.9 kB

The character version of gapminder$continent takes up almost twice as much space! Storing things as one of several possible integer values behind the scenes is a lot more efficient than storing the entire set of text for every single entry. Note that the amount of memory saved depends on the repetitiveness of the data. If a column has a lot of unique text values, converting it to a factor will likely be of little benefit.

The takeaway here is that if you ever find yourself working on a large dataset and memory usage becomes an issue, converting your most repetitive string columns to factors can be a useful way to save space. Otherwise, just use character vectors (less hidden “gotchas” that way).