QCBS R Workshops

This series of 10 workshops walks participants through the steps required to use R for a wide array of statistical analyses relevant to research in biology and ecology. These open-access workshops were created by members of the QCBS both for members of the QCBS and the larger community.

The content of this workshop has been peer-reviewed by several QCBS members. If you would like to suggest modifications, please contact the current series coordinators, listed on the main wiki page

Workshop 2: Loading and manipulating data

Developed by: Johanna Bradie, Vincent Fugère, Thomas Lamy

Summary: In this workshop, you will learn how to load, view, and manipulate your data in R. You will learn basic commands to inspect and visualize your data, and learn how to fix errors that may have occurred while loading your data into R. In addition, you will learn how to write an R script, which is a text file that contains your R commands and allows you to rerun your analyses in one simple touch of a key (or maybe two, or three…)! We will then introduce tidyr and dplyr, two powerful tools to manage and re-format your dataset, as well as apply simple or complex functions on subsets of your data. This workshop will be useful for those progressing through the entire workshop series, but also for those who already have some experience in R and would like to become proficient with new tools and packages.

Link to new Rmarkdown presentation

Link to old Prezi presentation

Download the R script and data for this lesson:

  1. Creating an R project
  2. Writing a script
  3. Loading, exploring and saving data
  4. Learn to manipulate data frames with tidyr, dplyr, maggritr

What is this?

  1. Within RStudio, Projects make it easy to separate and keep your work organized.
  2. All files, scripts, documentation related to a specific project are bound together

Encourages reproducibility and easy sharing.

Use the Create project command (available in the Projects menu and the global toolbar)

One project = one folder

  • Datasets should be stored as comma separated files (.csv) in Data folder.
  • comma separated files (.csv) can be created from almost all applications (Excel, LibreOffice, GoogleDocs)
  • file → save as .csv

Choose file names wisely

  • Good:
    • rawDatasetAgo2017.csv
    • co2_concentrations_QB.csv
    • 01_figIntro.R
  • Bad:
    • final.csv (Uninformative!)
    • safnnejs.csv (Random!)
    • 1-4.csv (Avoid using numbers!)
    • Dont.separate.names.with.dots.csv (Can lead to reading file errors!)

Choose variable names wisely

  • Use short informative titles (i.e. “Time_1” not “First time measurement”)
    • Good: “Measurements”, “SpeciesNames”, “Site”
    • Bad: “a”, “3”, “supercomplicatedverylongname”
  • Column values must match their intended use

Things to consider with your data

  • No text in numeric columns
  • Do not include spaces!
  • NA (not available) can be used for missing values, and blank entries will automatically be replaced with NA
  • Name your variables informatively
  • Look for typos!
  • Avoid numeric values for data that do not have a numeric meaning (i.e. subject, replicate, treatment)
    • For example, if subjects are “1,2,3” change to “A,B,C” or “S1,S2,S3”
  • Use CONSISTENT formats for dates, numbers, metrics, etc.
  • Do not include notes, additional headings, or merged cells!
  • One variable per column!

Bad data examples

It is possible to do all your data preparation work within R. This has several benefits:

  • Saves time for large datasets
  • Keeps original data intact
  • Keeps track of the manipulation and transformation you did
  • Can switch between long and wide format data very easily (more on this later and in workshop 4)

An R script is a text file that contains all of the commands you will use. Once written and saved, your R script will allow you to make changes and re-run analyses with little effort.

To use a script, just highlight commands and press “Run” or press command-enter (Mac) or ctrl-enter (PC).

Creating an R script

Commands & Comments

Use the '# symbol' to denote comments in scripts. The '# symbol' tells R to ignore anything remaining on a given line of the script when running commands.

Since comments are ignored when running script, they allow you to leave yourself notes in your code or tell collaborators what you did. A script with comments is a good step towards reproducible science, and annotating someone's script is a good way to learn. Try to be as detailed as possible!

# This is a comment, not a command

It is recommended that you use comments to put a header at the beginning of your script with essential information: project name, author, date, version of R

## QCBS R Workshop ##
## Workshop 2 - Loading and manipulating data
## Author: Quebec Center for Biodiversity Science
## Date: Fall 2014
## R version 2.15.0

Section Heading

You can use four # signs in a row to create section headings to help organize your script. This allows you to move quickly between sections and hide sections. For example:

#### Housekeeping ####

RStudio displays a small arrow next to the line number where the section heading was created. If you click on the arrow, you will hide this section of the script.

You can also move quickly between sections using the drop-down menu at the bottom of the script window.

Housekeeping

The first command at the top of all scripts should be rm(list=ls()). This will clear R's memory, and will help prevent errors such as using old data that has been left in your workspace.

|
rm(list=ls())  # Clears R workspace
?rm
?ls

We can test this command by adding data to the workspace and seeing how rm(list=ls()) will remove it.

|
A<-"Test"     # Put some data into workspace, to see how rm(list=ls()) removes it
A <- "Test"   # Note that you can use a space before or after <-
A = "Test"    # <- or = can be used equally
 
# Note that it is best practice to use "<-" for assignment instead of "="
 
A
rm(list=ls())
A

Important Reminders

  1. R is ready for commands when you see the chevron '>' displayed in the terminal. If the chevron isn't displayed, it means you typed an incomplete command and R is waiting for more input. Press ESC to exit and get R ready for a new command.
  2. R is case sensitive. i.e. “A” is a different object than “a”
|
a<-10  
A<-5
a
A
 
rm(list=ls())  # Clears R workspace again

Working directory

R needs to know the directory where your data and files are stored in order to load them. You can see which directory you are currently working in by using the getwd() command.

|
getwd() # This commands shows the directory you are currently working in

When you load a script, R automatically sets the working directory to the folder containing the script.

To change working directories using the setwd() function, specify the working directory's path using a “/” to separate folders, subfolders and file names. You can also click Session > Set working directory > Choose directory…

Display the content of the working directory

The command dir() displays the content of the working directory.

|
dir() # This command shows the content of the directory you are currently working in

You can check:

  • Whether or not the file you plan to open is present in the current directory
  • The correct spelling of the file name (e.g. 'myfile.csv' instead of 'MyFile.csv')

Importing data

Use the read.csv() command to import data in R.

|
CO2 <- read.csv("co2_good.csv") # Creates an object called CO2 by loading data from a file called "co2_good.csv" 

This command specifies that you will be creating an R object named “CO2” by reading a csv file called “co2_good.csv”. This file must be located in your current working directory.

Recall that the question mark can be used to find out what arguments the function requires.

|
?read.csv # Use the question mark to pull up the help page for a command  

In the help file you will note that adding the argument header=TRUE tells R that the first line of the spreadsheet contains column names and not data.

|
CO2 <- read.csv("co2_good.csv", header = TRUE) 

NOTE: If your operating system or CSV editor is in French, you may need to use read.csv2() instead of read.csv()

Notice that RStudio now provides information on the CO2 data in your workspace. The workspace refers to all the objects that you create during an R session.

Looking at data

The CO2 dataset consists of repeated measurements of CO2 uptake from six plants from Quebec and six plants from Mississippi at several levels of ambient CO2 concentration. Half of the plants of each type were chilled overnight before the experiment began.

There are some common commands that are useful to look at imported data:

CO2 Look at the whole data frame
head(CO2) Look at the first few rows
tail(CO2) Look at the last few rows
names(CO2) Names of the columns in the data frame
attributes(CO2) Attributes of the data frame
dim(CO2) Dimensions of the data frame
ncol(CO2) Number of columns
nrow(CO2) Number of rows
summary(CO2) Summary statistics
str(CO2) Structure of the data frame

The str() command is very useful to check the data type/mode for each column (i.e. to check that all factors are factors, and numeric data is stored as an integer or numeric. There are many common problems:

  • Factors loaded as text (character) and vice versa
  • Factors including too many levels because of a typo
  • Numeric or integer data being loaded as character due to a typo (including a space or using a comma instead of a “.” for a decimal)

Exercise

Try to reload the data using:

|
CO2<-read.csv("co2_good.csv",header=FALSE)

Check the str() of CO2. What is wrong here? Reload the data with header=TRUE before continuing.

Reminder from workshop 1: Accessing data

Data within a data frame can be extracted by several means. Let's consider a data frame called mydata. Use square brackets to extract the content of a cell.

|
mydata[2,3] # extracts the content of row 2 / column 3

If column number is omitted, the whole row is extracted.

|
mydata[1,] # extracts the content of the first row

The squared brackets can also be used recursively

|
mydata[,1][2] # this extracts the second content of the first column

If row number is omitted, the whole column is extracted. Similarly, the $ sign followed by the corresponding header can be used.

|
mydata$Variable1 # extracts a specific column by its name ("Variable1")

Renaming variables

Variable names (i.e. column names) can be changed within R.

|
# First let's make a copy of the dataset to play with!
CO2copy <- CO2
# names() gives you the names of the variables present in the data frame 
names(CO2copy)
 
# Changing from English to French names (make sure you have the same levels!)
names(CO2copy) <- c("Plante","Categorie", "Traitement", "conc","absortion")

Creating new variables

New variables can be easily created and populated. For example, variables and strings can be concatenated together using the function paste().

|
# Let's create an unique id for our samples using the function paste()
# see ?paste and ?paste0
# Don't forget to use "" for strings 
CO2copy$uniqueID <- paste0(CO2copy$Plante,"_", CO2copy$Categorie, "_", CO2copy$Traitement)
 
# Observe the results 
head(CO2copy$uniqueID)

Creating new variables works for numbers and mathematical operations as well!

|
# Let's standardize our variable "absortion" to relative values 
CO2copy$absortionRel = CO2copy$absortion/max(CO2copy$absortion) # Changing to relative values 
 
# Observe the results 
head(CO2copy$absortionRel)

Subsetting data

There are many ways to subset a data frame.

|
# Let's keep working with our CO2copy data frame 
 
## Subsetting by variable name 
CO2copy[,c("Plante", "absortionRel")] # Selects only "Plante" and "absortionRel" columns. (Don't forget the ","!)
 
## Subsetting by row 
CO2copy[1:50,] # Subset data frame from rows from 1 to 50 
 
### Subsetting by matching with a factor level 
CO2copy[CO2copy$Traitement == "nonchilled",] # Select observations matching only the nonchilled Traitement. 
 
### Subsetting according to a numeric condition 
CO2copy[CO2copy$absortion >= 20, ] # Select observations with absortion higher or equal to 20  
 
### Conditions can be complimentary -The & (and) argument-
CO2copy[CO2copy$Traitement  == "nonchilled" & CO2copy$absortion >= 20, ]
 
# We are done playing with the dataset copy. Let's erase it.
CO2copy <- NULL

Go here to check all the logical operators you can use to subset a data frame in R

Data exploration

A good way to start your data exploration is to look at some basic statistics of your dataset using the summary() function.

|
summary(CO2) # Get summary statistics of your dataset

You can also use some other functions to calculate basic statistics about specific parts of your data frame, using mean(), sd(), hist(), and print().

|
# Calculate mean and standard deviation of the concentration, and assign them to new variables
meanConc <- mean(CO2$conc)
sdConc <- sd(CO2$conc)
 
# print() prints any given value to the R console
print(paste("the mean of concentration is:", meanConc))
print(paste("the standard deviation of concentration is:", sdConc))
 
# Let's plot a histogram to explore the distribution of "uptake"
hist(CO2$uptake)
 
# Increasing the number of bins to observe better the pattern 
hist(CO2$uptake, breaks = 40)

The function apply() can be used to apply a function to multiple columns of your data simultaneously. Use the ?apply command to get more information about apply().

|
?apply

To use apply, you have to specify three arguments. The first argument is the data you would like to apply the function to; the second argument is whether you would like to calculate based on rows (1) or columns (2) of your dataset; the third argument is the function you would like to apply. For example:

|
apply(CO2[,4:5], MARGIN = 2, FUN = mean) # Calculate mean of the two columns in the data frame that contain continuous data

Save your workspace

By saving your workspace, you can save the script and the objects currently loaded into R. If you save your workspace, you can reload all of the objects even after you use the rm(list=ls()) command to delete everything in the workspace.

Use save.image() to save the workplace:

|
save.image(file="co2_project_Data.RData") # Save workspace
 
rm(list=ls())  # Clears R workspace
 
load("co2_project_Data.RData") #Reload everything that was in your workspace
 
head(CO2) # Looking good! :)

Exporting data

If you want to save a data file that you have created or edited in R, you can do so using the write.csv() command. Note that the file will be written into the current working directory.

|
write.csv(CO2,file="co2_new.csv") # Save object CO2 to a file named co2_new.csv

Use your data CHALLENGE

Try to load, explore, plot and save your own data in R. Does it load properly? If not, try fixing it in R. Save your fixed data and then try opening it in Excel.

Data can be messy, there are compatibility issues. For example, sharing data from a Mac to Windows or between computers set up in different continents can lead to weird datasets.

Let's practice how to solve some common errors.

Fix a broken dataframe CHALLENGE

# Read co2_broken.csv file into R and find the problems

|
CO2<-read.csv("co2_broken.csv") # Overwrite CO2 object with broken CO2 data 
head(CO2) # Looks messy
CO2 # Indeed!
  • This is probably what your data or downloaded data looks like.
  • You can fix the data frame in R (or not…)
  • Give it a try before looking at the solution!
  • Work with your neighbours and have fun :)

Some useful functions:

  • ?read.csv
  • head()
  • str()
  • class()
  • unique()
  • levels()
  • which()
  • droplevels()

Note: For these functions you have to put the name of the data object in the parentheses (i.e. head(CO2)). Also remember that you can use “?” to look up help for a function (i.e. ?str).

HINT: There are 4 problems!

Answers:

Answer #1

Click to display ⇲

Click to hide ⇱

Click to hide ⇱

Problem #1: The data appears to be lumped into one column

Solution:

Click to display ⇲

Click to hide ⇱

Click to hide ⇱

Re-import the data, but specify the separation among entries. The sep argument tells R what character separates the values on each line of the file. Here, “TAB” was used instead of “,”.

|
CO2 <- read.csv("co2_broken.csv",sep = "")
?read.csv

Answer #2

Click to display ⇲

Click to hide ⇱

Click to hide ⇱

Problem #2: The data does not start until the third line of the txt file, so you end up with notes on the file as the headings.

|
head(CO2) # The head() command allows you to see that the data has not been read in with the proper headings

Solution:

Click to display ⇲

Click to hide ⇱

Click to hide ⇱

To fix this problem, you can tell R to skip the first two rows when reading in this file.

|
CO2<-read.csv("co2_broken.csv",sep = "",skip=2)  # By adding the skip argument into the read.csv function, R knows to skip the first two rows
head(CO2) # You can now see that the CO2 object has the appropriate headings

Answer #3

Click to display ⇲

Click to hide ⇱

Click to hide ⇱

Problem #3: “conc” and “uptake” variables are considered factors instead of numbers, because there are comments/text in the numeric columns.

|
str(CO2) # The str() command shows you that both 'conc' and 'uptake' are labelled as factors
class(CO2$conc)
unique(CO2$conc) # By looking at the unique values in this column, you see that both columns contain "cannot_read_notes" 
unique(CO2$uptake) 
?unique

Solution:

Click to display ⇲

Click to hide ⇱

Click to hide ⇱

|
?read.csv

|
CO2 <- read.csv("co2_broken.csv",sep = "",skip = 2,na.strings = c("NA","na","cannot_read_notes")) 

By identifying “cannot_read_notes” as NA data, R reads these columns properly. Remember that NA (capital!) stands for not available.

|
head(CO2)
str(CO2) # You can see that conc variable is now an integer and the uptake variable is now treated as numeric

Answer #4

Click to display ⇲

Click to hide ⇱

Click to hide ⇱

Problem #4: There are only two treatments (chilled and nonchilled) but there are spelling errors causing it to look like 4 different treatments.

|
str(CO2) # You can see that 4 levels are listed for Treatment
levels(CO2$Treatment)
unique(CO2$Treatment) # The 4 different treatments are "nonchilled", "nnchilled", "chilled", and "chiled"  

Solution:

Click to display ⇲

Click to hide ⇱

Click to hide ⇱

|
# You can use which() to find rows with the typo "nnchilled"
which(CO2$Treatment=="nnchilled") # Row number ten
# You can then correct the error using indexing:
CO2$Treatment[10] <- "nonchilled"
# Alternatively, doing it with a single command:
CO2$Treatment[which(CO2$Treatment=="nnchilled")] <- "nonchilled"
# Now doing the same for "chiled":
CO2$Treatment[which(CO2$Treatment=="chiled")] <- "chilled" 

Have we fixed the problem?

|
str(CO2)  # Structure still identifies 4 levels of the factor
unique(CO2$Treatment) # But, unique says that only two are used
CO2<-droplevels(CO2) # This command drops the unused levels from all factors in the data frame
str(CO2) # Fixed!

Using tidyr to reshape data frames

Why "tidy" your data?

Tidying allows you to manipulate the structure of your data while preserving all original information. Many functions in R require (or work better) with a data structure that isn't always easily readable by people.

In contrast to aggregation, which reduces many cells in the original data set to one cell in the new dataset, tidying preserves a one-to-one connection. Although aggregation can be done with many functions in R, the tidyr package allows you to both reshape and aggregate within a single syntax.

Install / Load the tidyr() package:

|
if(!require(tidyr)){install.packages("tidyr")}
library(tidyr)

Wide vs. long data

Wide format data has a separate column for each variable or each factor in your study. One row therefore can therefore include several different observations.

Long format data has a column stating the measured variable types and a column containing the values associated to those variables (each column is a variable, each row is one observation). This is considered “tidy” data because it is easily interpreted by most packages for visualization and analysis in R.

The format of your data depends on your specific needs, but some functions and packages such as dplyr, lm(), glm(), gam() require long format data. The ggplot2 package can use wide data format for some basic plotting, but more complex plots require the long format (example to come).

Additionally, long form data can more easily be aggregated and converted back into wide form data to provide summaries, or to check the balance of sampling designs.

We can use the tidyr package to to manipulate the structure of your data while preserving all original information, using the following functions:

  • 1. gather() our data (wide –> long)
  • 2. spread() our data (long –> wide)

Let's pretend you send out your field assistant to measure the diameter at breast height (DBH) and height of three tree species for you. The result is this “wide” data set.

|
> wide <- data.frame(Species = c("Oak", "Elm", "Ash"),
                          DBH = c(12, 20, 13),
                       Height = c(56, 85, 55))
> wide
  Species DBH Height
1     Oak  12     56
2     Elm  20     85
3     Ash  13     55

gather(): Making your data long

|
?gather

Most of the packages in the Hadleyverse will require long format data where each row is an entry and each column is a variable. Let's try to “gather” the this wide data using the gather() function in tidyr. gather() takes multiple columns, and gathers them into key-value pairs.

The function requires at least 3 arguments:

  • data: a data frame (e.g. “wide”)
  • key: name of the new column containing variable names (e.g. “Measurement”)
  • value: name of the new column containing variable values (e.g. “Value”)
  • : name or numeric index of the columns we wish to gather (e.g. “DBH” or “Height”)
|
# Gathering columns into rows
 
> long <- gather(wide, Measurement, Value, DBH, Height)
> long
  Species Measurement Value
1     Oak         DBH 12
2     Elm         DBH 20
3     Ash         DBH 13
4     Oak      Height 56
5     Elm      Height 85
6     Ash      Height 55

Let's try this with the C02 dataset. Here we might want to collapse the last two quantitative variables:

|
CO2.long <- gather(CO2, response, value, conc, uptake)
head(CO2)
head(CO2.long)
tail(CO2.long)

spread(): Making your data wide

spread() uses the same syntax as gather(). The function requires 3 arguments:

  • data: A data frame (e.g. “long”)
  • key: Name of the column containing variable names (e.g. “Measurement”)
  • value: Name of the column containing variable values (e.g. “Value”)
|
# Spreading rows into columns
> wide2 <- spread(long, Measurement, Value)
> wide2
  Species DBH Height
1     Ash  13     55
2     Elm  20     85
3     Oak  12     56

separate(): Separate two (or more) variables in a single column

Some times you might have really messy data that has two variables in one column. Thankfully the separate() function can (wait for it) separate the two variables into two columns.

The separate() function splits a columns by a character string separator. It requires 4 arguments:

  • data: A data frame (e.g. “long”)
  • col: Name of the column you wish to separate
  • into: Names of new variables to create
  • sep: Character which indicates where to separate

Let's create a really messy data set:

|
set.seed(8)
messy <- data.frame(id = 1:4,
                          trt = sample(rep(c('control', 'farm'), each = 2)),
               zooplankton.T1 = runif(4),
                      fish.T1 = runif(4),
               zooplankton.T2 = runif(4),
                      fish.T2 = runif(4))
messy

First, we want to convert this wide dataset to long format.

|
messy.long <- gather(messy, taxa, count, -id, -trt)
head(messy.long)

Then, we want to split those two sampling times (T1 & T2). The syntax we use here is to tell R separate(data, what column, into what, by what). The tricky part here is telling R where to separate the character string in your column entry using a regular expression to describe the character that separates them. Here, the string should be separated by the period “.”.

|
messy.long.sep <- separate(messy.long, taxa, into = c("species", "time"), sep = "\\.")
head(messy.long.sep) 

The argument sep = “\\.” tells R to splits the character string around the period (.). We cannot type directly “.” because it is a regular expression that matches any single character.

Recap: tidyr

tidyr is a package that reshapes the layout of data sets.

  • Convert from wide format to long format using gather()
  • Convert from long format to wide format using spread()
  • Split and merge columns with unite() and separate()

Here's cheat sheet to help you use tidyr and dplyr for more data wrangling: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

tidyr CHALLENGE

Using the airquality dataset, gather() all the columns (except Month and Day) into rows. Then spread() the resulting dataset to return the same data format as the original data.

|
?air.quality
data(airquality)

Solution


Intro to dplyr

The vision of the dplyr package is to simplify data manipulation by distilling all the common data manipulation tasks to a set of intuitive functions (or “verbs”). The result is a comprehensive set of tools that facilitates data manipulation, such as filtering rows, selecting specific columns, re-ordering rows, adding new columns and summarizing data.

In addition to ease of use, it is also an amazing package because:

  • it can crunch huge datasets wicked fast (written in Cpp)
  • it plays nice with the RStudio IDE and other packages in the Hadleyverse
  • it can interface with external databases and translate your R code into SQL queries
  • if Batman was an R package, he would be dplyr (mastering fear of data, adopting cool technologies)

Certain R base functions work similarly to dplyr functions, including: split(), subset(), apply(), sapply(), lapply(), tapply() and aggregate()

Let's install and load the dplyr package:

|
if(!require(dplyr)){install.packages("dplyr")}
library(dplyr)

The dplyr package is built around a core set of “verbs” (or functions). We will start with the following 4 verbs because these operations are ubiquitous in data manipulation:

  • select(): select columns from a data frame
  • filter(): filter rows according to defined criteria
  • arrange(): re-order data based on criteria (e.g. ascending, descending)
  • mutate(): create or transform values in a column

Select a subset of columns with ''select()''

The general syntax for this function is select(dataframe, column1, column2, …). Most dplyr functions will follow a similarly simple syntax. select() requires at least 2 arguments:

  • data: the dataset to manipulate
  • : column names, positions, or complex expressions (separated by commas)

For example:

|
select(data, column1, column2) # select columns 1 and 2
select(data, c(2:4,6) # select columns 2 to 4 and 6
select(data, -column1) # select all columns except column 1
select(data, start_with(x.)) # select all columns that start with "x."

Here are more examples of how to use select():

The airquality dataset contains several columns:

|
> head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

For example, suppose we are only interested in the variation of “Ozone” over time within the airquality dataset, then we can select the subset of required columns for further analysis:

|
> ozone <- select(airquality, Ozone, Month, Day)
> head(ozone)
  Ozone Month Day
1    41     5   1
2    36     5   2
3    12     5   3
4    18     5   4
5    NA     5   5
6    28     5   6

Select a subset of rows with ''filter()''

A common operation in data manipulation is the extraction of a subset based on specific conditions. The general syntax for this function is filter(dataframe, logical statement 1, logical statement 2, …).

Remember that logical statements provide a TRUE or FALSE answer. The filter() function retains all the data for which the statement is TRUE. This can also be applied on characters and factors. Here is a useful reminder of how logic works in R.

For example, in the airquality dataset, suppose we are interested in analyses that focus on the month of August during high temperature events:

|
> august <- filter(airquality, Month == 8, Temp >= 90)
> head(august)
  Ozone Solar.R Wind Temp Month Day
1    89     229 10.3   90     8   8
2   110     207  8.0   90     8   9
3    NA     222  8.6   92     8  10
4    76     203  9.7   97     8  28
5   118     225  2.3   94     8  29
6    84     237  6.3   96     8  30

Sorting rows with ''arrange()''

In data manipulation, we sometimes need to sort our data (e.g. numerically or alphabetically) for subsequent operations. A common example of this is a time series.

The arrange() function re-orders rows by one or multiple columns, using the following syntax: arrange(data, variable1, variable2, …).

By default, rows are sorted in ascending order. Note that we can also sort in descending order by placing the target column in desc() inside the arrange() function as follows: arrange(data, variable1, desc(variable2), …).

Example: Let's use the following code to create a scrambled version of the airquality dataset

|
> air_mess <- sample_frac(airquality, 1)
> head(air_mess)
    Ozone Solar.R Wind Temp Month Day
21      1       8  9.7   59     5  21
42     NA     259 10.9   93     6  11
151    14     191 14.3   75     9  28
108    22      71 10.3   77     8  16
8      19      99 13.8   59     5   8
104    44     192 11.5   86     8  12

Now, let's arrange the data frame back into chronological order, sorting by Month, and then by Day:

|
> air_chron <- arrange(air_mess, Month, Day)
> head(air_chron)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

Try to see the difference when we change the order of the target columns:

|
arrange(air_mess, Day, Month)

Create and populate columns with ''mutate()''

Besides subsetting or sorting your data frame, you will often require tools to transform your existing data or generate some additional data based on existing variables. We can use the function mutate() to compute and add new columns in your dataset.

The mutate() function follows this syntax: mutate(data, newVar1 = expression1, newVar2 = expression2, …).

Let's create a new column using mutate(). For example, suppose we would like to convert the temperature variable from degrees Fahrenheit to degrees Celsius:

|
> airquality_C <- mutate(airquality, Temp_C = (Temp-32)*(5/9))
> head(airquality_C)
  Ozone Solar.R Wind Temp Month Day   Temp_C
1    41     190  7.4   67     5   1 19.44444
2    36     118  8.0   72     5   2 22.22222
3    12     149 12.6   74     5   3 23.33333
4    18     313 11.5   62     5   4 16.66667
5    NA      NA 14.3   56     5   5 13.33333
6    28      NA 14.9   66     5   6 18.88889

Note that the syntax here is quite simple, but within a single call of the mutate() function, we can replace existing columns, we can create multiple new columns, and each new column can be created using newly created columns within the same function call.


The magrittr package brings a new and exciting tool to the table: a pipe operator. Pipe operators provide ways of linking functions together so that the output of a function flows into the input of next function in the chain. The syntax for the magrittr pipe operator is %>%. The magrittr pipe operator truly unleashes the full power and potential of dplyr, and we will be using it for the remainder of the workshop. First, let's install and load it:

|
if(!require(magrittr)){install.packages("magrittr")}
require(magrittr)

Using it is quite simple, and we will demonstrate that by combining some of the examples used above. Suppose we wanted to filter() rows to limit our analysis to the month of June, then convert the temperature variable to degrees Celsius. We can tackle this problem step by step, as before:

|
june_C <- mutate(filter(airquality, Month == 6), Temp_C = (Temp-32)*(5/9))

This code can be difficult to decipher because we start on the inside and work our way out. As we add more operations, the resulting code becomes increasingly illegible. Instead of wrapping each function one inside the other, we can accomplish these 2 operations by linking both functions together:

|
june_C <- airquality %>% 
    filter(Month == 6) %>%
    mutate(Temp_C = (Temp-32)*(5/9))

Notice that within each function, we have removed the first argument which specifies the dataset. Instead, we specify our dataset first, then “pipe” into the next function in the chain.

The advantages of this approach are that our code is less redundant and functions are executed in the same order we read and write them, which makes its easier and quicker to both translate our thoughts into code and read someone else's code and grasp what is being accomplished. As the complexity of your data manipulations increases, it becomes quickly apparent why this is a powerful and elegant approach to writing your dplyr code.

Quick tip: In RStudio we can insert this pipe quickly using the following hotkey: Ctrl (or Cmd for Mac) +Shift+M.

The dplyr verbs we have explored so far can be useful on their own, but they become especially powerful when we link them with each other using the pipe operator (%>%) and by applying them to groups of observations. The following functions allow us to split our data frame into distinct groups on which we can then perform operations individually, such as aggregating/summarising:

  • group_by(): group data frame by a factor for downstream commands (usually summarise)
  • summarise(): summarise values in a data frame or in groups within the data frame with aggregation functions (e.g. min(), max(), mean(), etc…)

These verbs provide the needed backbone for the Split-Apply-Combine strategy that was initially implemented in the plyr package on which dplyr is built.

Let's demonstrate the use of these with an example using the airquality dataset. Suppose we are interested in the mean temperature and standard deviation within each month:

|
> month_sum <- airquality %>% 
      group_by(Month) %>% 
      summarise(mean_temp = mean(Temp),
                sd_temp = sd(Temp))
> month_sum
Source: local data frame [5 x 3]
 
  Month mean_temp  sd_temp
  (int)     (dbl)    (dbl)
1     5  65.54839 6.854870
2     6  79.10000 6.598589
3     7  83.90323 4.315513
4     8  83.96774 6.585256
5     9  76.90000 8.355671

Using the ChickWeight dataset, create a summary table which displays the difference in weight between the maximum and minimum weight of each chick in the study. Employ dplyr verbs and the %>% operator.

|
?ChickWeight
data(ChickWeight)

Solution


Note that we can group the data frame using more than one factor, using the general syntax as follows: group_by(group1, group2, …)

Within group_by(), the multiple groups create a layered onion, and each subsequent single use of the summarise() function peels off the outer layer of the onion. In the above example, after we carried out a summary operation on group2, the resulting data set would remain grouped by group1 for downstream operations.

Using the ChickWeight dataset, create a summary table which displays, for each diet, the average individual difference in weight between the end and the beginning of the study. Employ dplyr verbs and the %>% operator. (Hint: first() and last() may be useful here.)

Solution


In addition to all the operations we have explored, dplyr also provides some functions that allow you to join two data frames together. The syntax in these functions is simple relative to alternatives in other R packages:

  • left_join()
  • right_join()
  • inner_join()
  • anti_join()

These are beyond the scope of the current introductory workshop, but they provide extremely useful functionality you may eventually require for some more advanced data manipulation needs.