Vectors and matrices are super cool. However they don’t address an important issue: holding multiple types of data and working with them at the same time. Dataframes are another special data structure that let’s you handle large amounts and different types of data together. Because of this, they are generally the tool-of-choice for doing analyses in R.

We are going to focus on using dataframes using the dplyr package. dplyr comes as part of the tidyverse package bundle, you can install it with install.packages("tidyverse"). It can take awhile to install this on Linux, so perhaps start the command in another window while we go through the non-dplyr parts.

A small example

In a text editor, create the following example CSV file. We’ll call it cats.csv.

coat,weight,likes_string
calico,2.1,1
black,5.0,0
tabby,3.2,1

Once we’ve saved it in the same directory we’re working in, we can load it with read.csv().

cats <- read.csv('cats.csv')
cats
##     coat weight likes_string
## 1 calico    2.1            1
## 2  black    5.0            0
## 3  tabby    3.2            1

Whenever we import a dataset with multiple types of values, R will autodetect this and make the output a dataframe. Let’s verify this for ourselves:

class(cats)
## [1] "data.frame"

So, we’ve got a dataframe with multiple types of values. How do we work with it? Fortunately, everything we know about vectors also applies to dataframes.

Each column of a dataframe can be used as a vector. We use the $ operator to specify which column we want.

cats$weight + 34
## [1] 36.1 39.0 37.2
class(cats$weight)
## [1] "numeric"
cats$coat
## [1] calico black  tabby 
## Levels: black calico tabby

We can also reassign columns as if they were variables. The cats$likes_string likely represents a set of boolean value, lets update that column to reflect this fact.

class(cats$likes_string)  # before
## [1] "integer"
cats$likes_string <- as.logical(cats$likes_string)
class(cats$likes_string)
## [1] "logical"

We can even add a column if we want!

cats$age <- c(1, 6, 4, 2.5)
Error in `$<-.data.frame`(`*tmp*`, age, value = c(1, 6, 4, 2.5)) : 
  replacement has 4 rows, data has 3

Notice how it won’t let us do that. The reason is that dataframes must have the same number of elements in every column. If each column only has 3 rows, we can’t add another column with 4 rows. Let’s try that again with the proper number of elements.

cats$age <- c(1, 6, 4)
cats
##     coat weight likes_string age
## 1 calico    2.1         TRUE   1
## 2  black    5.0        FALSE   6
## 3  tabby    3.2         TRUE   4

Note that we don’t have to call class() on every single column to figure out what they are. There are a number of useful summary functions to get information about our dataframe.

str() reports on the structure of your dataframe. It is an extremely useful function - use it on everything if you’ve loaded a dataset for the first time.

str(cats)
## 'data.frame':    3 obs. of  4 variables:
##  $ coat        : Factor w/ 3 levels "black","calico",..: 2 1 3
##  $ weight      : num  2.1 5 3.2
##  $ likes_string: logi  TRUE FALSE TRUE
##  $ age         : num  1 6 4

As with matrices, we can use dim() to know how many rows and columns we’re working with.

dim(cats)
## [1] 3 4
nrow(cats)  # number of rows only
## [1] 3
ncol(cats)  # number of columns only
## [1] 4

Factors

When we ran str(cats), you might have noticed something weird. cats$coat is listed as a “factor”. A factor is a special type of data that’s almost a string.

It prints like a string (sort of):

cats$coat
## [1] calico black  tabby 
## Levels: black calico tabby

It can be used like a string:

paste("The cat is", cats$coat)
## [1] "The cat is calico" "The cat is black"  "The cat is tabby"

But it’s not a string! The output of str(cats) gives us a clue to what’s actually happening behind-the-scenes.

str(cats)
## 'data.frame':    3 obs. of  4 variables:
##  $ coat        : Factor w/ 3 levels "black","calico",..: 2 1 3
##  $ weight      : num  2.1 5 3.2
##  $ likes_string: logi  TRUE FALSE TRUE
##  $ age         : num  1 6 4

str() reports that the first values are 2, 1, 3 (and not text). Let’s use as.numeric() to reveal its true form!

as.numeric(cats$coat)
## [1] 2 1 3
cats$coat
## [1] calico black  tabby 
## Levels: black calico tabby

A factor has two components, its levels and its values. Levels represent all possible values for a column. In this case, there’s only 3 possiblities: black, calico and tabby.

The actual values are 2, 1, and 3. Each value matches up to a specific level. So in our example, the first value is 2, which corresponds to the second level, calico. The second value is 1, which matches up with the first level, black.

Factors in R are a method of storing text information as one of several possible “levels”. R converts text to factors automatically when we import data, like from a CSV file. We’ve got several options here:

Convert the factor to a character vector ourselves:

cats$coat <- as.character(cats$coat)
class(cats$coat)
## [1] "character"

Tell R to simply not convert things to factors when we import it (as.is=TRUE is the R equivalent of “don’t touch my stuff!”):

new_cats <- read.csv('cats.csv', as.is=TRUE)
class(new_cats$coat)
## [1] "character"

Use the read_csv() function from the readr package. readr is part of the tidyverse and has a number of ways of reading/writing data with more sensible defaults.

library(tidyverse)
## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1     ✔ readr   1.1.1
## ✔ tibble  1.4.1     ✔ dplyr   0.7.4
## ✔ tidyr   0.7.2     ✔ forcats 0.2.0
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
even_newer_cats <- read_csv('cats.csv')
## Parsed with column specification:
## cols(
##   coat = col_character(),
##   weight = col_double(),
##   likes_string = col_integer()
## )
class(even_newer_cats$coat)
## [1] "character"
Next section

© Jeff Stafford // https://jstaf.github.io/r-data-science/