QCBS R Workshops

This series of 10 workshops walks participants through the steps required to use R for a wide array of statistical analyses relevant to research in biology and ecology. These open-access workshops were created by members of the QCBS both for members of the QCBS and the larger community.

The content of this workshop has been peer-reviewed by several QCBS members. If you would like to suggest modifications, please contact the current series coordinators, listed on the main wiki page

Workshop 5: Programming in R

Developed by: Johanna Bradie, Sylvain Christin, Ben Haller, Guillaume Larocque

Summary: This workshop focuses on basic programming in R. In this workshop, you will learn how to use control flow (for loops, if, while) methods to prevent code repetition, facilitate organization and run simulations. In addition, you will learn to write your own functions, and tips to program efficiently. The last part of the workshop will discuss packages that will not covered elsewhere in this workshop series, but that may be of interest to participants.

Link to associated Prezi: Prezi

Download the R script for this lesson:

  1. Control flow
  2. Writing functions in R
  3. Speeding up your code
  4. Useful packages for biologists

A program is a sequence of instructions that tells your computer how it should do a given task. You can compare a program to a recipe that tells you how to cook your dinner. In this example, you are the computer, your recipe is the program that tells you how to do your dinner, and your dinner is the task your program is doing.

Programs exist to make your life simple! Whenever programs are well-structured, they tend to decrease the complexity and time of the task at hand, they are clear and will likely increase your productivity.

The ordered structure in which the instructions of a program are executed is defined as control flow.

The main elements of control flow are:

- SELECTION: Executing statements depending on a condition

and

- ITERATION: Executing statements multiple times


Throughout this workshop, we will be making use of logical operators. Note that to test whether something is equal to something else, you must use a “==”.

== equal to
!= not equal to
!x not x
< less than
< = less than or equal to
> greater than
>= greater than or equal to
x & y x AND y
x|y x OR y
isTRUE(x) test if X is true

Decision making: if and if/else statements

Decision making is an essential part of programming. This can be done in R using the if and if … else statements. They will be very useful for:

  • checking for problems or violations of assumptions
  • treating different rows of your data frame differently
  • testing for the existence of a file or variable

Syntax

if (condition) {
  expression      # The expression can be any command that you would like R to perform.
}     
 
if (condition) {
  expression 
} else {
  expression
}

For example,

if ((2+2) == 4) {
  print("Arithmetic works.") 
}
 
if ((2+1) == 4) {
  print("Arithmetic works.") 
}

Curly brackets { } are used so that R knows to expect more input. When using brackets, R waits to evaluate the command until the brackets have been closed. If the curly brackets are not used, R may not behave as you are expecting. For example, try:

if ((2 + 1) == 4) print("Arithmetic works.")
else print("Houston, we have a problem.")

The else statement does not work because R evaluates the first line without knowing your command is incomplete.

Instead, use:

if ((2 + 2) == 4) {
  print("Arithmetic works.")  # R does not evaluate this expression yet because the bracket isn't closed.
} else {
  print("Houston, we have a problem.")
}  # Since all brackets are now closed, R will evaluate the commands.

If you want to read in a vector and check each element of the vector one by one without using indices or a loop, you can use the ifelse() function.

For example,

a <- 1:10
ifelse(a > 5, "yes", "no")

You can also use the ifelse() function within another function to carry out an operation only under certain conditions:

For example,

a <- (-4):5
sqrt(ifelse(a >= 0, a, NA))

Exercise 1

Paws <- "cat"
Scruffy <- "dog"
Sassy <- "cat"
animals <- c(Paws, Scruffy, Sassy)

1. Use an if statement to print “meow” if Paws is a “cat”.

Exercise 1.1 : Answer

2. Use an if/else statement to print “woof” if you supply an object that is a “dog” and “meow” if it is not. Try it out with Paws and Scruffy.

Exercise 1.2 : Answer

3. Use the ifelse() function to display “woof” for animals that are dogs and “meow” for animals that are cats.

Exercise 1.3 : Answer


Loops

Loops are good for:

  • doing something for every element of an object
  • doing something until the processed data runs out
  • doing something for every file in a folder
  • doing something that can fail, until it succeeds
  • iterating a calculation until it converges

for loops

The for loop is the most common type of loop. Use a for loop to execute a block of code a known number of times.

Syntax

for (variable in sequence) {
  expression
}

Each time the series of commands in a loop are executed, it is known as an iteration.

For example:

for (i in 1:5) {
  print(i)
}

In this example, our sequence has 5 elements (1, 2, 3, 4, 5). R will then evaluate the expression 5 times. The variable i is specific to our loop and will be only accessible inside it. Its value will be each of the element of our sequence. Therefore, in the first iteration, R will replace each instance of i with 1. In the second iteration i would be replaced with 2, and so on.

The letter 'i' can be replaced with any variable name and the sequence can be almost anything, even a list of vectors.

Try:

for (m in 4:10) {
  print(m * 2) 
}
 
for (a in c("Hello", "R", "Programmers")) {
  print(a) 
}
 
for (z in 1:30) {
  a <- rnorm(n = 1, mean = 5, sd = 2) # draw a value from a normal distribution with mean 5 and standard deviation 2
  print(a)
}
 
elements <- list(1:3, 4:10)
for (element in elements) {
  print(element)
}

Loops are often used to loop over a dataset. We will use loops to perform functions on the CO2 dataset which is built in to R. The CO2 dataset contains concentration and uptake values for plants located in Quebec and Mississippi for an experiment on the cold tolerance of the grass species Echinochloa crus-galli. Half the plants of each type were chilled overnight before the experiment was conducted. Note that this is the dataset we used in workshop 2. The code below provides a couple of examples of how loops can be used.

data(CO2)
for (i in 1:length(CO2[,1])) { # for each row in the CO2 dataset
  print(CO2$conc[i]) #print the CO2 concentration
}
 
for (i in 1:length(CO2[,1])) { # for each row in the CO2 dataset
  if(CO2$Type[i] == "Quebec") { # if the type is "Quebec"
    print(CO2$conc[i]) #print the CO2 concentration }
  }
}
 
# Tip 1 : to get the number of rows of a data frame, we can also use the function nrow
for (i in 1:nrow(CO2)) { # for each row in the CO2 dataset
  print(CO2$conc[i]) #print the CO2 concentration
}
 
# Tip 2 : If we want to perform operations on only the elements of one column, we can directly
# iterate over it.
for (i in CO2$conc) { # for every element of the concentration column of the CO2 dataset
  print(i) # print the ith element
}

The expression part of the loop can be almost anything and is usually a compound statement containing many commands.

for (i in 4:5) { # for i in 4 to 5
  print(colnames(CO2)[i])  
  print(mean(CO2[,i])) # print the mean of that column from the CO2 dataset
}

while loops and repeat loops

while loops and repeat loops operate similarly to for loops. Once you understand how for loops work, you should be able to use any type of loop. You will see some examples of while loops and repeat loops in the next section. Note that in many cases you can accomplish the same task many different ways– either using for loops, while loops, or repeat loops as will be demonstrated below.

Nested loops

In some cases, you may want to use nested loops to accomplish a task. When using nested loops, it is important to use different variables as counters for each of your loops (here we used i and n).

for (i in 1:5) {
  for (n in 1:5) {
    print (i*n)
  }
}

The apply() family

To prevent the problem of growing objects in loops and to facilitate the application of functions on objects like dataframes, R offers us what we will call the apply functions (because they all have “apply” in their name…). It is a group of functions that will execute another function on a given object type. Their use only differ depending on the type of object the function is applied to or the type of the return value.

The …apply() functions are not always the best choice performance wise as they will usually hide a for loop written in R in their code. However, they can greatly reduce the programming time needed by the ease-of-use they provide.

One of the most popular is simply apply() that executes a function on the rows or columns of a dataframe or a matrix. This function takes 3 main arguments:

  • the object on which we want to apply the function
  • the margin or the subscript on which we want to apply the function. 1 is for rows, 2 is for columns
  • the function to apply
  • the eventual arguments to the function supplied

All apply functions work on the same model. Performance wise, the most interesting are probably lapply() and vapply() since they are primitive function written in C. lapply() returns a list of the same length as the original object. vapply() allows you to specify the format of the value returned by your function. It can be a vector or an array. Other functions include sapply(), which works as lapply(), but it tries to simplify the output to the most elementary data structure that is possible, mapply(), which vectorize arguments to a function that is not usually accepting vectors as arguments, and tapply(), which is useful when we need to break up a vector into groups defined by some classifying factor, compute a function on the subsets, and return the results in a convenient form. You can even specify multiple factors as the grouping variable.

Note that this type of iteration uses vectorisation, we will talk about it later.

df <- data.frame(1:100, 101:200)
# Sum on rows
apply(df, 1, sum)
# Mean on columns
apply(df, 2, mean)
# we can also supply additionnal arguments to the function
apply(df, 2, mean, na.rm=TRUE)
# we can also define a function directly. The first argument is always what 
# we iterate on. Here each row is treated as a vector of numbers, as we can see with
# the str() function
apply(df, 1, function(x){str(x)})
# We can also add other arguments
apply(df, 1, function(x, y){x[2] - x[1] + y}, y=5)
# We add a column with a categorical variable to use as a factor to group the data
df[,3]<-sample(c("A","B","C"),nrow(df),replace=T)
# We apply the mean() function on column 1 for each group from column 3
tapply(df[,1],df[,3],mean)
a <- list(1:100, 101:200)
# apply mean to each element of the list
lapply(a, mean)  # we get a list as a result
unlist(lapply(a, mean)) # use unlist to get a vector instead
vapply(a, mean, 0) # the result of mean is a single number, we tell vapply our result will be a number

Exercise 2

1. You have realized that your tool for measuring uptake was not calibrated properly at Quebec sites and all measurements are 2 units higher than they should be. Use a loop to correct these measurements for all Quebec sites.

2. Use a vectorisation-based method to calculate the mean CO2-uptake in both areas.

Exercise 2 : Answer


Make sure you reload the data so that we are working with the raw data for the rest of the exercise:

data(CO2)


Loop Modifications

Normally, loops iterate over and over until they finish. To change this behavior, you can use break which breaks out of the loops execution entirely, or next, which stops executing the current iteration and jumps to the next iteration.

For example,

# Print the CO2 concentrations for "chilled" treatments and keep count of how many replications there were.
 
count <- 0 # count is being set at zero so that we can edit this object in the loop to keep track of how many iterations were performed
 
for (i in 1:length(CO2[,1])) {
  if (CO2$Treatment[i] == "nonchilled") next #Skip to next iteration if treatment is nonchilled
  count <- count + 1
  print(CO2$conc[i])
}
print(count) # The count and print command were performed 42 times.
 
# This could be equivalently written using a repeat loop: 
 
count <- 0
i <- 0
repeat {
      i <- i + 1
      if (CO2$Treatment[i] == "nonchilled") next  # skip this loop
      count <- count + 1
      print(CO2$conc[i])
      if (i == length(CO2[,1])) break     # stop looping
    }  
 
print(count) 
 
### This could also be written using a while loop: 
 
i <- 0
count <- 0
while (i < length(CO2[,1]))
{
  i <- i + 1
  if (CO2$Treatment[i] == "nonchilled") next  # skip this loop
  count <- count + 1
  print(CO2$conc[i])
}
print(count) 

Exercise 3

1. You have realized that your tool for measuring concentration didn't work properly. At Mississippi sites, concentrations less than 300 were measured correctly but concentrations >= 300 were overestimated by 20 units. Use a loop to correct these measurements for all Mississippi sites.

Exercise 3 : Answer


Make sure you reload the data so that we are working with the raw data for the rest of the exercise:

data(CO2)

Using flow control to make a complex plot

The idea here is that we have a dataset we want to plot, with concentration and uptake values, but each point has a type (Quebec or Mississippi) and a treatment (“chilled” or “nonchilled”) and we want to plot the points differently for these cases.

You can read more about mathematical typesetting with ?plotmath, and more about the way that different colors, sizes, rotations, etc. are used in ?par.

head(CO2) # Look at the dataset
unique(CO2$Type) 
unique(CO2$Treatment)
 
# plot the dataset, showing each type and treatment as a different colour
 
plot(x=CO2$conc, y=CO2$uptake, type="n", cex.lab=1.4, xlab="CO2 concentration", ylab="CO2 uptake") # Type "n" tells R to not actually plot the points.
 
for (i in 1:length(CO2[,1])) {
  if (CO2$Type[i] == "Quebec" & CO2$Treatment[i] == "nonchilled") {
    points(CO2$conc[i], CO2$uptake[i], col="red",type="p")
  }
  if (CO2$Type[i] == "Quebec" & CO2$Treatment[i] == "chilled") {
    points(CO2$conc[i], CO2$uptake[i], col="blue")
  }
  if (CO2$Type[i] == "Mississippi" & CO2$Treatment[i] == "nonchilled") {
    points(CO2$conc[i], CO2$uptake[i], col="orange")
  }
  if (CO2$Type[i] == "Mississippi" & CO2$Treatment[i] == "chilled") {
    points(CO2$conc[i], CO2$uptake[i], col="green")
  }
}

Note, that there are other ways to create a complex plot. ggplot2 is a useful package for creating complex plots that was covered in workshop 3.


Exercise 4

1. Generate a plot of showing concentration versus uptake where each plant is shown using a different colour point. Bonus points for doing it with nested loops!

Exercise 4 : Answer


Why write functions?

Much of the heavy lifting in R is done by functions. They are useful for:

  • performing a task repeatedly, but configurably
  • making your code more readable
  • make your code easier to modify and maintain
  • sharing code between different analyses
  • sharing code with other people
  • modifying R’s built-in functionality

But what exactly is a function? A function is essentially a black box that transforms data. It takes variables as entry - called arguments - use R code to process them and then can optionally give back a return value.

200

How to write functions?

Here is the basic syntax of a function:

function_name <- function(argument1, argument2, ...) {
  expression...  # What we want the function to do
  return(value)  # Optional. If you want to access to the result of your function
}

Arguments

Arguments are the entry values of your function. They are the information your function needs to be able to perform correctly. A function can have between 0 and an infinity of arguments. From a technical standpoint, arguments are variables like any other so you can use them as such. The only difference is that they are available only inside your function. Their value will be determined at the moment your function will be called.

Let's start with a really basic function we will call print_number that will take as argument a number and will print it.

print_number <- function(number) {
  print(number)
}

Now to use it we call it like any other function.

print_number(2)
print_number(231)

We can use more than one arguments. For example let's create a function that take a number1, add it to a number2, multiply the result by a number3 and then prints the result.

operations <- function(number1, number2, number3) {
  result <- (number1 + number2) * number3
  print(result)
}
 
operations(1, 2, 3)
operations(17, 23, 2)

The expression part of our function - its body - can be virtually anything. It can be single R statements, loops, if/else conditions etc.


Exercise 5

Using what you learned previously on flow control, create a function print_animal that takes an animal as argument and gives the following results :

Scruffy <- "dog"
Paws <- "cat"
 
print_animal(Scruffy)
[1] "woof"
 
print_animal(Paws)
[1] "meow"

Exercise 5 : Answer


Arguments can also be optional and be provided with a default value. This is useful when using a function often with the same settings as it prevents the need from writing all the arguments all the time but still provides the flexibility to be able to change it if needed.

operations <- function(number1, number2, number3=3) {
  result <- (number1 + number2) * number3
  print(result)
}
 
operations(1, 2, 3) # becomes equivalent to
operations(1, 2)
operations(1, 2, 2) # we can still change the value of number3 if needed

R also provides a special argument “”. It allows you to tell R that your function will accept an indefinite number of arguments. This is useful for two main things:

  • Pass on arguments to another function used inside your function. This allows you to be able to use all the arguments of other functions without having to define them when creating yours. Let's create a function from our previous example where we plot the CO2 uptake based on the concentration. Here we will plot with 2 different colors depending on the region. Parameters to plot() and points() will be passed on via “…”.
plot.CO2 <- function(CO2, ...) {
  plot(x=CO2$conc, y=CO2$uptake, type="n", ...)  # We use ... to pass on arguments to plot(). 
 
  for (i in 1:length(CO2[,1])){
     if (CO2$Type[i] == "Quebec") {
       points(CO2$conc[i], CO2$uptake[i], col="red", type="p", ...) # same for points
     } else if (CO2$Type[i] == "Mississippi") {
       points(CO2$conc[i], CO2$uptake[i], col="blue", type="p", ...) # same for points()
     }
  }
}
 
plot.CO2(CO2, cex.lab=1.4, xlab="CO2 concentration", ylab="CO2 uptake")
plot.CO2(CO2, cex.lab=1.4, xlab="CO2 concentration", ylab="CO2 uptake", pch=20)
NOTE : It is really important to note that while using “…”, to avoid any confusion, we will have to pass arguments by name.
  • Allow the user to input an indefinite number of arguments. The value of each argument will then have to be checked manually. This is done by transforming “…” as a list and iterating over it. Let's create a sum function that takes an indefinite number of arguments.
sum2 <- function(...){
  args <- list(...)
  result <- 0
  for (i in args)  {
    result <- result + i
  }
  return (result)
}
 
sum2(2, 3)
sum2(2, 4, 5, 7688, 1)

Return value

As shown in the previous example, if we want to be able to save the result of our function and be able to use it later, we have to return it at the end using return(). It is important to note that only one return value can be given by a function. If you want to return more than one object, you will have to use composite object such as lists or dataframes. Also, it is important to note that the function ends once it reaches the return() keyword.

returntest <- function(a, b) {
  return (a) # The function exits here
  a <- a + b # Not interpreted
  return (a + b) # Not interpreted
}
 
returntest(2, 3) # R will by default print the return value of your function
c <- returntest(2, 3) # to save it, don't forget to assign it to another variable
c

Exercise 6

Using what you learned so far on functions and control flow, create a function bigsum that takes two arguments a and b and :

  • returns 0 if the sum of a and b is strictly less than 50
  • returns the sum of a and b otherwise

Exercise 6 : Answer


Accessibility of variables

When working with flow control structures and functions, it is essential to always keep in mind where your variables are and whether they are defined and accessible. Here are some tips to keep in mind

  • Variables defined inside a function are not accessible outside
  • Variables defined outside a function are accessible inside. But it is NEVER a good idea to use them inside as your function might not function anymore if the variable doesn't exist
rm(list=ls()) # first let's remove everything to avoid any confusion
 
var1 <- 3     # var1 is defined outside our function
vartest <- function() {
  a <- 4      # a is defined inside
  print(a)    # print a
  print(var1) # print var1
}
a             # print a. It doesn't work, a can be seen only inside the function
vartest()     # calling vartest() will print a and var1 
rm(var1)      # remove var1
vartest()     # calling the function again doesn't work anymore

Instead, use arguments!! Inside a function, arguments names will take over other variable names.

var1 <- 3     # var1 is defined outside our function
vartest <- function(var1) {
  print(var1) # print var1
}
vartest(8)    # Inside our function var1 is now our argument and takes its value
var1          # var1 still has the same value

Be very careful when creating variables inside a conditionnal statement as the variable can never be created and cause errors

a <- 3
if (a > 5) {
  b <- 2
} 
a + b    # Error! b doesn't exist

Usually it is a good practice to define variables outside the conditions and then modify their value to avoid any problem

a <- 3
b <- 0
if (a > 5) {
  b <- 2
} 
a + b

Good practices

Here are some programming tips that can make your life easier, help achieve greater readability and makes sharing and reusing your code a lot less painful. Having a clear to read code will reduce the time you'll spend to understand it so it's never time lost.

Keep a clean and nice code

One thing that often helps the most when reading programming code, is to have a nicely formatted code, well spaced and well indented code. Some programming stardards exist to help achieve a great consistency but usually it ends up to one's personnal preferences. Here are some tips that can help:

  • Use spaces between and after your operators
  • Use consistentely the same assignation operator. `←` is often preferred, `=` is ok but don't switch all the time between the two
  • Use brackets when using flow control statements, even if it's for one line. Each statement inside brackets should be indented by two spaces. The closing brackets would usually be all by themselves on a seperate line, except when preceding an else statement. This helps greatly when trying to identify where we are, especially with a lot of nested loops/conditions.
  • Define each variable on its own line

Here is some hard to read code

a<-4;b=3
if(a<b){
if(a==0)print("a zero") } else {
if(b==0){print("b zero")} else print(b)}

Here is a little easier-to-read version. It takes more space but it is easier to see the flow of the code.

a <- 4
b <- 3
if(a < b){
  if(a == 0) {
    print("a zero")
  }
} else {
  if(b == 0){
    print("b zero")
  } else {
    print(b)
  }
}

There are some coding styles that can be found around the internet. Here is an example that provides a clear code : https://google-styleguide.googlecode.com/svn/trunk/Rguide.xml

Use functions whenever possible

Now that you know how to create functions, try to use them everytime you can. Everytime you see some portion of code that is repeated more than two times in your script, you should be thinking “Hmmm… would it not be better to write a function instead?”. If only a part of the code change, try thinking of ways to insert them as arguments inside a function instead. This would reduce the number of errors done by copying/pasting and the time needed to correct them.

Let's modify the example from exercise 3 and suppose that all CO2 uptake from Mississipi were overestimated by 20 and Quebec underestimated by 50. We could write this

for (i in 1:length(CO2[,1])) {
  if(CO2$Type[i] == "Mississippi") {
    CO2$conc[i] <- CO2$conc[i] - 20 
  }
}
for (i in 1:length(CO2[,1])) {
  if(CO2$Type[i] == "Quebec") {
    CO2$conc[i] <- CO2$conc[i] + 50 
  }
}

Or we could do this instead.

recalibrate <- function(CO2, type, bias) {
  for (i in 1:nrow(CO2)) {
    if(CO2$Type[i] == type) {
      CO2$conc[i] <- CO2$conc[i] + bias 
    }
  }
  # we have to return our new dataset because the original is not modified
  return (CO2)
}
newCO2 <- recalibrate(CO2, "Mississipi", -20)
# Note that we recalibrate our newCO2 dataset here because the original CO2 is not modified
newCO2 <- recalibrate(newCO2, "Quebec", +50)

And now, we realize that what we modified was not the uptake but the concentration… We now have to change all occurences of CO2\$conc[i] by CO2\$uptake[i]. In the first case, it means we have to change it 4 times, and only 2 times in the second one! (At this point, you might think that this is not much and you can do it with a simple search/replace and you might be right but this is just a simple example! Imagine you had 10 locations instead of 2. A good programmer is a lazy one. Also, admit it, it looks way cooler with a function…)

Give meaningful variable and function names

This help to see at first glance what does what. Be extra careful when choosing your argument names when creating a function since it's what users will see. However it is also good to choose short names to avoid having to type them all the time and making typos so a good balance should be reached.

Here is what our previous example could look like with vague names. It now requires a little more work to understand what this function does.

rc <- function(c, t, b) {
  for (i in 1:nrow(c)) {
    if(c$Type[i] == t) {
      c$uptake[i] <- c$uptake[i] + b 
    }
  }
  return (c)
}

Comments

Even with meaningful names, it's never a bad thing to add comment to describe everything your code does, be it the purpose of a function, how to use its arguments or a detailed step by step of the function.

## recalibrates the CO2 dataset by modifying the CO2 uptake concentration
## by a fixed amount depending on the region of sampling
# Arguments
# CO2: the CO2 dataset
# type: the type that need to be recalibrated. Values: "Mississippi" or "Quebec"
# bias: the amount to add to the concentration uptake. Use negative values for overestimations 
recalibrate <- function(CO2, type, bias) {
  for (i in 1:nrow(CO2)) {
    if(CO2$Type[i] == type) {
      CO2$uptake[i] <- CO2$uptake[i] + bias 
    }
  }
  # we have to return our new dataset because the original is not modified
  return (CO2)
}

Here are some programming tips to program more efficiently with R and achieve greater performance and faster code. Note that before optimizing your code, you should always make sure to have a working code beforehand. A slow working code is always better than a fast broken implementation. Also, sometimes, optimizing is just not the way to go. If you spend 2 hours rewritting code to gain a few seconds, then it might just not be worth it…

Before we start : profiling our code

If we want to optimize our code, we will need to know how much time each task takes to perform.

The simplest way to do that is to use the function system.time(expression)

system.time({
a <- 0
  for (i in 1:1000) {
    a <-  a + i
  }
})

Note that most of the time R works really fast and you will need to have some heavy computing to do or your time might not even be recorded. It is usually recommended to perform iterations of the task you want to profile or work with really big datasets.

system.time(replicate(1000, {
  a <- 0
  for (i in 1:1000) {
    a <-  a + i
  }
}))

One other simple and useful tool is the function Rprof(). The main advantage of Rprof() is that it saves information about time spent in each function in a file that you can access later. Here's how to use it.

Rprof("profile.txt")  # you can change profile.txt by the filename you want
for (i in 1:1000) {
    a <- 0
    for (i in 1:1000) {
      a <-  a + i
    }
  }
Rprof()               # This ends the profiling
summaryRprof("profile.txt")  # Use the filename previously recorded to display the summary of the tasks

Finally, if you want to compare the efficiency or several functions, a very good tool is the package microbenchmark

install.packages("microbenchmark")
library(microbenchmark)
 
f1 <- function() {
  a <- 0
  for (i in 1:1000) {
    a <-  a + i
  }
}
 
microbenchmark(f1(), times=1000) # the argument times allow us to determine how many iterations we want

First step : thinking a bit!

When you look at your code, often you will realize that there are simpler, way more efficient ways to do what you want and that some operations can be easily removed for added speed.

For example, let's create a function that takes a number a. We will add a to every number from 1 to 100, and if a is less than 5, then we will add 2*a instead. Then we will calculate the sum of all the elements of the sequence.

Here's a way to do it

f2 <- function(a) {
  # initialize our result
  result <- 0
  # iterate on the sequence from 1 to 100
  for (i in 1:100) {
    if (a < 5) {
      # a is < 5, we add 2*a to the sequence element. We save it in result 
      result <- result + i + (2*a)
    } else {
      # a is >= 5, we add only a 
      result <- result + i + a
    }
  }
  return(result)
}
f2(4)

This is ok and does what we want. However we have a lot of useless steps in our code. For example, we don't really need to perform our condition on each iteration, as the result will always be the same. So let's take it out of the loop

f3 <- function(a) {
  # initialize our result
  result <- 0
 
  # Check if a < 5. If true, a becomes 2*a
  if (a < 5) {
   a <- 2 * a
  } 
  # We don't even need an else here since a remains the same otherwise
 
  # iterate on the sequence from 1 to n
  for (i in 1:100) {
      result <- result + i + a 
  }
  return(result)
}
 
f3(4)
microbenchmark(f2(4), 
               f3(4), times=1000)

This is just a simple modification but here we sped up our code by almost 40% (results may vary depending on computers). Moreover, our code is easier to read and to understand. Sometimes, we can gain speed and easier code just by thinking a little more of where our conditional statements can be and what they test.

But, using the strengths of R, we can do even better!

f4 <- function(a, n) {
  result <- 0
 
  if (a < 5) {
    a <- a + 1
  } 
  result <- sum(1:n + a)
  return(result)
}
 
f4(4)
microbenchmark(f3(4), 
               f4(4), times=1000)

Wow, here the improvement is much more efficient… But how exactly did this happen? This leads us to our next point

Vectorisation

This part is a reminder of what you probably already learned in the first workshops. However, this is usually overlooked and forgotten and the bad performance of R code can often be attributed to bad vectorisation. R is meant to function with vectors and many functions in R are optimized for vectorisation. To understand that, it is first important to understand how R works at a low level. R is an interpreted language, which means that when you execute R code, you are in reality sending your code to functions programmed in another language (the C language). This slows down the execution of your programs since first you have to interpret the R code then pass it on to other functions. When you create a loop in R, you have to decode every iteration and pass it on. Vectorised functions on the other hand are functions that perform operations directly on a vector. Essentially, they also perform a loop on your vector, but the big difference is that they do it in C, not R and are therefore much faster. sum() is one of these functions. One of the biggest challenge with R is learning to think and program with vectors and not with single elements. For example, most of the basic operations can be done on vectors.

v1 <- 1:5
v2 <- 2:6
v3 <- 1:3
v1 + 2      # Addition on a vector : adds 2 to all elements
v1 + v2     # Adds each element of v2 to v
v1 + v3     # v1 and v3 are not the same length, then we add from the start of v3 again
sum(v1)     # Adds all elements of v1 together
sum(v1, v2) # Sums all elements of v1 and v2 
mean(v1)    # Average of elements in v1
mean(c(v1, v2)) # Average of elements of v1 and v2. Unlike sum, we have to combine them beforehand

Subsetting

To vectorise efficiently, it is also important to be able to extract values from our data quickly. To apply treatment on specific elements of a vector or a dataframe, R offers a subsetting tool that can be sometimes way more efficient and easy to write than a mix of loops and conditions.

Subsetting is done via the [ and the $ (for a dataframe) operators. We can insert directly our conditions inside the [] part of the subsetting to quickly extract values from our data. We can also use the function which() to test for a condition. which() returns the indexes of the elements that matches the condition.

v1 <- 1:10
v1[7]      # Extracts the 7th value
v1[v1 > 5] # Extracts values > 5 only
v1[which(v1 > 5)]  # same as before

In dataframes, $ allows to access a column by name. We can also do this by providing the name of the column directly

CO2 <- read.csv("co2_good.csv")
CO2$Type  # Prints columns Type
CO2[, "Type"] # Same as above
CO2[CO2$Type == "Quebec", ] #Extracts all rows of the CO2 dataset where the Type is "Quebec" 

Exercise 7

Create a new function recalibrate2() that is a rewrite the function recalibrate seen earlier using subsetting and vectorisation techniques. The new function should not be longer than 3 lines.
Reminder:

recalibrate <- function(CO2, type, bias) {
  for (i in 1:nrow(CO2)) {
    if(CO2$Type[i] == type) {
      CO2$uptake[i] <- CO2$uptake[i] + bias 
    }
  }
  return (CO2)
}

Exercise 7 : Answer


Growing objects

Vectorising is good, but sometimes it is hard to do and it might take you more time to do than just do a simple loop. Sometimes you just need loops and you should use them when you can. But when using them, one thing you should be aware of if you want decent performance are growing objects. That is objects that are getting bigger and bigger with each iteration. Let's illustrate this really simply by creating a function that iterates over a sequence and create a vector with it. We will compare 2 ways of doing that : by growing our object and by preallocating our result and just modify it.

growing <- function(n) {
  # declare our result
  result <- NULL
  for (i in 1:n) {
    # create our result by growing our object
    result <- c(result, i)
  }
  return(result)
}
 
growing2 <- function(n) {
  # declare our result : here we create a vector of length n with 0 in it
  result <- numeric(n)
  for (i in 1:n) {
    # now we just modify our value instead of recreating the vector
    result[i] <- i
  }
  return(result)
}

Now let's compare their speeds

system.time({
  growing(10000)
})
system.time({
  growing2(10000)
})

With a vector of 10000 elements, speeds are still comparable and under the second. Now let's just use 50000 elements

system.time({
  growing(50000)
})
system.time({
  growing2(50000)
})

By multiplying by only 5, it suddenly takes us several seconds just to create a single vector while modifying preallocated vector is still almost instantaneous. What happened here? The reason is that when you call a function, arguments are first copied before being passed on to your function. So when you write result ← c(result, i) each time, result is copied before being passed on to c(). As result grows which each iteration, in each iteration it takes more and more time to copy it. The bigger the end object, the more time it will take. This is why it is always better to create your result object before your loop if you already know what size it will be.

This is especially valid when working with dataframes and functions such as rbind() and cbind(). Unfortunately, preallocating dataframes does not work that well and there is a better way, albeit a little more complicated. It is to store each row (or column) in a preallocated list first and then call rbind() (or cbind())on all elements via the function do.call(). The function do.call() allows you to execute a named function on a list of arguments. This way rbind() is called only once, at the end, which removes the problem of copying the growing object each time we call it.

growingdf <- function(n, row) {
  # preallocate our dataframe
  df <- data.frame(numeric(n), character(n), stringsAsFactors=FALSE)
  for (i in 1:n) {
    # replace the ith row with row
    df[i,] <- row 
  }
  return(df)
}
 
growingdf2 <- function(n, row) {
  # this is the way to allocate a list with n elements
  df <- vector("list", n)
  for (i in 1:n) {
    # put row in the ith element
    df[[i]] <- row 
  }
  return(do.call(rbind, df))
}
 
# store our row in a list since we have different types
row <- list(1, "Hello World")
microbenchmark(growingdf(5000, row),
               growingdf2(5000, row),
               times=10)

4. Quick introduction to useful packages in R

Knitr is a package that can be used to generate dynamic reports or web pages from R code. The code is evaluated at the moment the report is generated.

Code can be easily written in RStudio use the Markdown language. :

Example Markdown code

View the resulting web page.

This packages provides a mechanism for chaining commands with a new forward-pipe operator, %>%.

iris %>%
subset(Sepal.Length > mean(Sepal.Length)) %$%
cor(Sepal.Length, Sepal.Width)

Data table is a very useful package which can facilitate and improve the efficiency of certain operations in R. Data tables are just like data frames. You can even create them from data frames.

Introduction to Data table (PDF)

install.packages('data.table')
library(data.table)

Generate very long data frame with one column with letters, and one column with random numbers

mydf<-data.frame(a=rep(LETTERS,each=1e5),b=rnorm(26*1e5))

Convert the data frame to a data table format.

mydt<-data.table(mydf)

Each data table has to be assigned a key, which is one (or more) of the columns from the table. This key defines the basis for the organization and the sorting of the table.

setkey(mydt,a)

Once the key is set, we can return all rows with column a (the key) equal to F

mydt['F']

Gives the mean value of column b for each letter in column a.

mydt[,mean(b),by=a]

Let's compare the performance of Data table with other methods to achieve the same thing.

system.time(t1<-mydt[,mean(b),by=a])

With tapply()

system.time(t2<-tapply(mydf$b,mydf$a,mean))

With reshape2

NOTE: plyr and reshape2 where covered in Workshop 3.

library(reshape2)
meltdf<-melt(mydf)
system.time(t3<-dcast(meltdf,a~variable,mean))

With plyr , a set of tools to split up a data into homogeneous pieces, apply a function to each piece and combine all the results back together.

library(plyr)
system.time(t4<-ddply(mydf,.(a),summarize,mean(b)))

With dplyr , a newer and faster version of plyr, which is adapted to work only on data frames.

library(dplyr)
ti1<-proc.time()
groups <- group_by(mydf, a)
t5 <- summarise(groups, total = mean(b))
eltime<-proc.time()-ti1

With sqldf. This package allows one to write Structured Query Language commands to perform queries on data frames.

library(sqldf)
system.time(t6<-sqldf('SELECT a, avg(b) FROM mydf GROUP BY a'))

With a basic FOR loop

ti1<-proc.time()
# Initialize an empty data frame with two columns and 26 rows
t7<-data.frame(letter=unique(mydf$a),mean=rep(0,26))
for (i in t6$letter ){
  t7[t7$letter==i,2]=mean(mydf[mydf$a==i,2])
}
eltime<-proc.time()-ti1
eltime

With a parallelized FOR loop Use foreach and doMC packages to run sections of code in parallel on computers with multiple cores. This is particularly suited to speed up some calculations involving FOR loops in which every iteration can be run independently of other iterations. Note that the doMC package may not work on computers running Windows. It should work on Linux or Mac OSX.

library(foreach)
library(doMC)
registerDoMC(4) #Four-core processor
ti1<-proc.time()
t8<-data.frame(letter=unique(mydf$a),mean=rep(0,26))
t8[,2] <- foreach(i=t8$letter, .combine='c') %dopar% {
 mean(mydf[mydf$a==i,2])
}
eltime<-proc.time()-ti1
eltime

The RgoogleMaps package allows to very simply show Google maps or Google Satellite images in R, centered and zoomed on a location of your choice. You can also relatively easily overlay some spatial data from your R workspace on the map (see below). The getGeocode function tranforms a text search for a postal code or place name to latitude, longitude coordinates using Google web services.

library(RgoogleMaps)
myhome=getGeoCode('McGill University, Montreal');
mymap<-GetMap(center=myhome, zoom=14)
PlotOnStaticMap(mymap,lat=myhome['lat'],lon=myhome['lon'],cex=5,pch=10,lwd=3,col=c('red'));

The rOpenSci project supports the development of a number of R packages to facilitate access to a number of online data sources. Among them is the package Taxize, which can be used to get taxonomic information from many different databases, including taxonomic synonyms, hierarchies, common names, and more, from a dozen different sources.

library(taxize)
spp<-tax_name(query=c("american beaver"),get="species")
fam<-tax_name(query=c("american beaver"),get="family")
correctname <- tnrs(c("fraxinus americanus"))
cla<-classification("acer rubrum", db = 'itis')

Another useful package from rOpenSci is Spocc which can be used to search for species occurrence data from a number of sources, including the Global Biodiversity Information Facility, a worldwide database containing hundreds of millions of species occurrences from collections and field data.

library(spocc)
occ_data <- occ(query = 'Acer nigrum', from = 'gbif')
mapggplot(occ_data)

Combine spocc and RgoogleMaps

occ_data <- occ(query = 'Puma concolor', from = 'gbif')
occ_data_df=occ2df(occ_data)
occ_data_df<-subset(occ_data_df,!is.na(latitude) & latitude!=0)
mymap<-GetMap(center=c(mean(occ_data_df$latitude),mean(occ_data_df$longitude)), zoom=2)
PlotOnStaticMap(mymap,lat=occ_data_df$latitude,lon=occ_data_df$longitude,cex=1,pch=16,lwd=3,col=c('red'));

Geonames connects R to Geonames.org, an online database of place names and toponyms.

library(geonames)
options(geonamesUsername="glaroc")
# Retrieve place names which contain the name "Mont Saint-Hilaire"
res<-GNsearch(q="Mont Saint-Hilaire")
res[,c('toponymName','fclName')]
#Extract all citites within a bouding box defined by the coordinates of the four corners. 
dc<-GNcities(45.4, -73.55, 45.7, -73.6, lang = "en", maxRows = 10)
dc[,c('toponymName')]