QCBS R Workshops

This series of 10 workshops walks participants through the steps required to use R for a wide array of statistical analyses relevant to research in biology and ecology. These open-access workshops were created by members of the QCBS both for members of the QCBS and the larger community.

The content of this workshop has been peer-reviewed by several QCBS members. If you would like to suggest modifications, please contact the current series coordinators, listed on the main wiki page

Workshop 2: Loading and manipulating data

Developed by: Johanna Bradie, Vincent Fugère, Thomas Lamy

Summary: In this workshop, you will learn how to load and view your data in R. You will learn basic commands to inspect and visualize your data, and learn how to fix errors that may have occurred while loading your data into R. In addition, you will learn how to write an R script, which is a text file that contains your R commands and allows you to rerun your analyses in one simple touch of a key ! (or maybe two, or three…)

Link to associated Prezi: Prezi

Download the R script and data for this lesson:

  1. Writing a script
  2. Loading, exploring and saving data
  3. Fixing a broken data frame

An R script is a text file that contains all of the commands you will use. Once written and saved, your R script will allow you to make changes and re-run analyses with little effort.

To use a script, just highlight commands and press “Run” or press command-enter (Mac) or ctrl-enter (PC).

Commands & Comments

Use the '# symbol' to denote comments in scripts. The '# symbol' tells R to ignore anything remaining on a given line of the script when running commands.

Since comments are ignored when running script, they allow you to leave yourself notes in your code or tell collaborators what you did. A script with comments is a good step towards reproducible science and annotating someone's script is a good way to learn.

# This is a comment, not a command

It is recommended that you use comments to put a header at the beginning of your script with essential information: project name, author, date, version of R

## QCBS R Workshop ##
## Workshop 2 - Loading and manipulating data
## Author: Quebec Center for Biodiversity Science
## Date: Fall 2014
## R version 2.15.0

Section Heading

You can use four # signs in a row to create section headings to help organize your script. For example:

#### Housekeeping ####

RStudio displays a small arrow next to the line number where the section heading was created. If you click on the arrow, you will hide this section of the script.

Housekeeping

It is good practice to have a command at the top of your script to clear the R memory. This will help prevent errors such as using old data that has been left in your workspace. The command rm(list=ls()) will clear memory.

|
rm(list=ls())  # Clears R workspace
?rm
?ls

We can test this command by adding data to the workspace and seeing how rm(list=ls()) will remove it.

|
A<-"Test"     # Put some data into workspace, to see how rm(list=ls()) removes it
A <- "Test"   # Note that you can use a space before or after <-
A = "Test"    # <- or = can be used equally
 
#Note that it is best practice to use "<-" for assignment instead of "="
 
A
rm(list=ls())
A

Important Reminders

  1. R is ready for commands when you see the chevron '>' displayed in the terminal. If the chevron isn't displayed, it means you typed an incomplete command and R is waiting for more input. Press ESC to exit and get R ready for a new command.
  2. R is case sensitive. i.e. “A” is a different object than “a”
|
a<-10  
A<-5
a
A
 
rm(list=ls())  # Clears R workspace again

Working Directory

R needs to know the directory where your data and files are stored in order to load them. You can see which directory you are currently working in by using the getwd() command.

|
getwd() # This commands shows the directory you are currently working in

When you load a script, R automatically sets the working directory to the folder containing the script.

To specify a path, use a “/” to separate folders, subfolders and file names.

There are several ways you can set the working directory:

  • You can simply type the full path of the directory in the parentheses of the command setwd(). For example:
|
setwd('/Users/vincentfugere/Desktop/QCBS_R_Workshop2')  # Mac Example 
setwd('C:/Users/Johanna/Documents/PhD/R_Workshop2')   # Windows Example
# **Note that this path will NOT work on your computer!
  • You can use choose.dir() to get a pop up to navigate to the appropriate directory.
|
setwd(choose.dir())  # Note that this may not work on a Mac.
  • You can click on session / set working directory / choose directory

Display The Content Of The Working Directory

The command dir() displays the content of the working directory.

|
dir() # This command shows the content of the directory you are currently working in

You can check:

  • Whether or not the file you plan to open is present in the current directory
  • The correct spelling of the file name (e.g. 'myfile.csv' instead of 'MyFile.csv')

Importing data

Use the read.csv() command to import data in R.

|
CO2<-read.csv("CO2_good.csv") # Creates an object called CO2 by loading data from a file called "CO2_good.csv" 

This command specifies that you will be creating an R object named “CO2” by reading a csv file called “CO2_good.csv”. This file must be located in your current working directory.

Alternatively, you can choose the file to load interactively using the file.choose() command.

|
CO2<-read.csv(file.choose()) 

Recall, that the question mark can be used to pull up the help page for a command.

|
?read.csv # Use the question mark to pull up the help page for a command  

In the help file you will note that adding the argument header=TRUE tells R that the first line of the spreadsheet contains column names and not data.

|
CO2<-read.csv("CO2_good.csv", header = TRUE) 

NOTE: If you have a French operating system or CSV editor, you may need to use read.csv2() instead of read.csv()

Looking at Data

The CO2 dataset consists of repeated measurements of CO2 uptake from six plants from Quebec and six plants from Mississippi at several levels of ambient CO2 concentration. Half of the plants of each type were chilled overnight before the experiment began.

There are some common commands that are useful to look at imported data:

CO2 Look at the whole data frame
head(CO2) Look at the first few rows
names(CO2) Names of the columns in the data frame
attributes(CO2) Attributes of the data frame
ncol(CO2) Number of columns
nrow(CO2) Number of rows
summary(CO2) Summary statistics
str(CO2) Structure of the data frame

The str() command is very useful to check the data type/mode for each column (i.e. to check that all factors are factors, and numeric data is stored as an integer or numeric. There are many common problems:

  • Factors loaded as text (character) and vice versa
  • Factors including too many levels because of a typo
  • Numeric or integer data being loaded as character due to a typo (including a space or using a comma instead of a “.” for a decimal)

Exercise

Try to reload the data using:

|
CO2<-read.csv("CO2_good.csv",header=FALSE)

Check the str() of CO2. What is wrong here? Reload the data with header=TRUE before continuing.

Reminder from workshop 1: Accessing Data

Data within a data frame can be extracted by several means. Let's consider a data frame called mydata. Use square brackets to extract the content of a cell.

|
mydata[2,3] # extracts the content of row 2 / column 3

If column number is omitted, the whole row is extracted.

|
mydata[1,] # extracts the content of the first row

If row number is omitted, the whole column is extracted. Similarly, the $ sign followed by the corresponding header can be used.

|
mydata[,1] # extracts the content of the first column
mydata$header # extracts the content of the column which has the corresponding header

Data Exploration

It can be very useful to plot all variable combinations when you are examining your data.

|
plot(CO2) # Plot of all variable combinations

Do you want to see if one of your variables is normally distributed? Use the hist() command.

|
hist(CO2$uptake) # The $ is used to extract a specific column from a data frame by name.

There are many built in functions in R that can be used to obtain information about your data. Two commonly used functions are mean() and sd().

|
conc_mean<-mean(CO2$conc) # Calculate mean of the "conc" column of the "CO2" object. Save as "conc_mean"
conc_mean # Display object "conc_mean"
 
conc_sd<-sd(CO2$conc) # Calculate sd of "conc" column and save as "conc_sd"
conc_sd

The function apply() can be used to apply a function to multiple columns of your data simultaneously. Use the ?apply command to get more information about apply().

|
?apply

To use apply, you have to specify three arguments. The first argument is the data you would like to apply the function to; the second argument is whether you would like to calculate based on columns (2) or rows(1) of data; the third argument is the function you would like to apply. For example:

|
apply(CO2[,4:5], MARGIN = 2, FUN = mean) # Calculate mean of the two columns in the data frame that contain continuous data

Save your workspace

By saving your workspace, you can save the script and the objects currently loaded into R. If you save your workspace, you can reload all of the objects even after you use the rm(list=ls()) command to delete everything in the workspace.

Use save.image() to save the workplace:

|
save.image(file="CO2_project_Data.RData") # Save workspace
 
rm(list=ls())  # Clears R workspace
 
load("CO2_project_Data.RData") #Reload everything that was in your workspace
 
head(CO2) # Looking good :)

Exporting data

If you want to save a data file that you have created or edited in R, you can do so using the write.csv() command. Note that the file will be written into the current working directory.

|
write.csv(CO2,file="CO2_new.csv") # Save object CO2 to a file named CO2_new.csv

Preparing data for R

  • When preparing files for R, you should save them as .csv files.
  • Almost all applications (Excel, GoogleDocs, LibreOffice, etc) can save a file as a csv (comma separated values)
  • Use short informative titles (i.e. “Time_1” not “First time measurement”)
  • Column values must match their intended use.
  • No text in numeric columns, including spaces
  • NA can be used for missing values
  • Avoid numeric values for data that does not have a numeric meaning (i.e. subject, replicate, treatment)
    • For example, if subjects are “1,2,3” change to “A,B,C” or “S1,S2,S3”
  • Do not include notes, additional headings, or merged cells!

It is possible to do all data preparation in R. This has several benefits:

Use your data

Challenge

Try to load, explore, plot and save your own data in R. Does it load properly? If not, try fixing it in R. Save your fixed data and then try opening it in Excel.

Harder Challenge

# Read a broken CO2 csv file into R and find the problems

|
CO2<-read.csv("CO2_broken.csv") # Overwrite CO2 object with broken CO2 data 
head(CO2) # Looks messy
CO2 # Indeed!
  • This is probably what your data or downloaded data looks like.
  • Fix it in R (or not)
  • Give it a try before looking at the solution!
  • Work with your neighbours and have fun :)

Some useful functions:

  • ?read.csv
  • head()
  • str()
  • class()
  • unique()
  • levels()
  • which()
  • droplevels()

Note: For these functions you have to put the name of the data object in the parentheses (i.e. head(CO2)). Also remember that you can use “?” to look up help for a function (i.e. ?str).

HINT: There are 4 problems!

Answers:

Answer #1

Click to display ⇲

Click to hide ⇱

Click to hide ⇱

Problem #1: The data appears to be lumped into one column

Solution:

Click to display ⇲

Click to hide ⇱

Click to hide ⇱

Re-import the data, but specify the separation among entries. The sep argument tells R what character separates the values on each line of the file. Here, “TAB” was used instead of “,”.

|
CO2 <- read.csv("CO2_broken.csv",sep = "")
?read.csv

Answer #2

Click to display ⇲

Click to hide ⇱

Click to hide ⇱

Problem #2: The data does not start until the third line of the txt file, so you end up with notes on the file as the headings.

|
head(CO2) # The head() command allows you to see that the data has not been read in with the proper headings

Solution:

Click to display ⇲

Click to hide ⇱

Click to hide ⇱

To fix this problem, you can tell R to skip the first two rows when reading in this file.

|
CO2<-read.csv("CO2_broken.csv",sep = "",skip=2)  # By adding the skip argument into the read.csv function, R knows to skip the first two rows
head(CO2) # You can now see that the CO2 object has the appropriate headings

Answer #3

Click to display ⇲

Click to hide ⇱

Click to hide ⇱

Problem #3: “conc” and “uptake” variables are considered factors instead of numbers, because there are comments/text in the numeric columns.

|
str(CO2) # The str() command shows you that both 'conc' and 'uptake' are labelled as factors
class(CO2$conc)
unique(CO2$conc) # By looking at the unique values in this column, you see that both columns contain "cannot_read_notes" 
unique(CO2$uptake) 
?unique

Solution:

Click to display ⇲

Click to hide ⇱

Click to hide ⇱

|
CO2 <- read.csv("CO2_broken.csv",sep = "",skip = 2,na.strings = c("NA","na","cannot_read_notes")) 

By identifying “cannot_read_notes” as NA data, R reads these columns properly. Remember that NA stands for not available.

|
head(CO2)
str(CO2) # You can see that conc variable is now an integer and the uptake variable is now treated as numeric

Answer #4

Click to display ⇲

Click to hide ⇱

Click to hide ⇱

Problem #4: There are only two treatments (chilled and nonchilled) but there are spelling errors causing it to look like 4 different treatments.

|
str(CO2) # You can see that 4 levels are listed for Treatment
levels(CO2$Treatment)
unique(CO2$Treatment) # The 4 different treatments are "nonchilled", "nnchilled", "chilled", and "chiled"  

Solution:

Click to display ⇲

Click to hide ⇱

Click to hide ⇱

|
# You can use which() to find rows with the typo "nnchilled"
which(CO2$Treatment=="nnchilled") # Row number ten
# You can then correct the error using indexing:
CO2$Treatment[10] <- "nonchilled"
# Alternatively, doing it with a single command:
CO2$Treatment[which(CO2$Treatment=="nnchilled")] <- "nonchilled"
# Now doing the same for "chiled":
CO2$Treatment[which(CO2$Treatment=="chiled")] <- "chilled" 

Have we fixed the problem?

|
str(CO2)  # Structure still identifies 4 levels of the factor
unique(CO2$Treatment) # But, unique says that only two are used
CO2<-droplevels(CO2) # This command drops the unused levels from all factors in the data frame
str(CO2) # Fixed!