Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
r_workshop5 [2017/11/29 09:21]
ph.pereirabraga
r_workshop5 [2018/11/14 16:37] (current)
katherinehebert [Pay attention to the next tips!]
Line 13: Line 13:
 **Summary:​** This workshop focuses on basic programming in R.  In this workshop, you will learn how to use control flow (for loops, if, while) methods to prevent code repetition, facilitate organization and run simulations. ​ In addition, you will learn to write your own functions, and tips to program efficiently. The last part of the workshop will discuss packages that will not covered elsewhere in this workshop series, but that may be of interest to participants. **Summary:​** This workshop focuses on basic programming in R.  In this workshop, you will learn how to use control flow (for loops, if, while) methods to prevent code repetition, facilitate organization and run simulations. ​ In addition, you will learn to write your own functions, and tips to program efficiently. The last part of the workshop will discuss packages that will not covered elsewhere in this workshop series, but that may be of interest to participants.
  
-Link to associated Prezi: [[https://​prezi.com/​xuu2rphp5wg4/​|Prezi]]+**Link to new [[https://​qcbsrworkshops.github.io/​Workshops/​workshop05/​workshop05-en/​workshop05-en.html|Rmarkdown presentation]]** 
 + 
 +Link to old [[https://​prezi.com/​xuu2rphp5wg4/​|Prezi ​presentation]]
  
 Download the R script for this lesson: ​ Download the R script for this lesson: ​
Line 21: Line 23:
 ===== Learning Objectives ===== ===== Learning Objectives =====
  
-  - Control ​flow +  - Learning what is **control ​flow;** 
-  - Writing functions in R+  - Writing ​your first functions in R
   - Speeding up your code   - Speeding up your code
-  - Useful packages for biologists+  - Useful ​packages for biologists
  
-===== 1. Control Flow =====+===== Control Flow =====
  
-A program is a sequence of instructions that tells your computer how it should do a given task. You can compare ​program to a recipe that tells you how to cook your dinner. In this example, you are the computer, your recipe is the program that tells you how to do your dinner, and your dinner is the task your program is doing.+Program flow control ​can be simply defined as the order in which a program is executed.
  
-Programs exist to make your life simple! Whenever ​programs ​are well-structured,​ they tend to decrease the complexity and time of the task at hand, they are clear and will likely increase your productivity.+**Why is it advantageous ​to have structured ​programs?**
  
-The ordered structure in which the instructions ​of a program are executed is defined as **control flow**.+- It decreases ​the complexity and time of the task at hand;
  
-The main elements of control flow are:+- This logical structure also means that the code has increased clarity;
  
-**SELECTION**:​ Executing statements depending ​on a condition+It also means that many programmers can work on one program.
  
-and +**This means increased productivity** 
 +----
  
-- **ITERATION**:​ Executing statements multiple times+Flowcharts can be used to plan programs and represent their structure
  
 +{{ :​flowchart1.png?​700 |}}
  
-----+===== Representing structure =====
  
 +The two basic building blocks of codes are the following:
  
-Throughout this workshop, we will be making use of **logical operators**.   +**Selection** 
-Note that to test whether something is equal to something else, you must use a "​=="​.+
  
-| == | equal to | +Program'​s execution determined by statements
-| != | not equal to | +
-| !x | not x | +
-| < | less than | +
-| < = | less than or equal to | +
-| > | greater than | +
-| >= | greater than or equal to | +
-| x & y | x AND y | +
-| x%%|%%y| x OR y | +
-| isTRUE(x) | test if X is true |+
  
 +<code rsplus>
 +if
 +if else
 +</​code>​
  
-==== Decision making: if and if/else statements ==== 
  
-Decision making is an essential part of programming. This can be done in R using the ''​if''​ and ''​if ... else''​ statements. They will be very useful for: +**Iteration**
-  ​checking for problems or violations of assumptions +
-  ​treating different rows of your data frame differently  +
-  ​testing for the existence of a file or variable+
  
 +Repetition, where the statement will **loop** until a criteria is met
  
-**Syntax**+<code rsplus>​ 
 +for 
 +while 
 +repeat 
 +</​code>​
  
-<code rsplus>​ +===== Decision making =====
-if (condition) { +
-  expression ​     # The expression can be any command that you would like R to perform. +
-}     +
  
-if (condition) ​+''​if''​ **statement** 
-  expression  + 
-} else {+<code rsplus>​ 
 +if(condition) {
   expression   expression
 } }
Line 84: Line 82:
  
  
 +{{ :if.png?300 |}}
  
  
-For example, ​+''​if else''​ **statement** 
 + 
  
 <code rsplus> <code rsplus>
-if ((2+2) == 4) { +if(condition) { 
-  ​print("​Arithmetic works."​) ​ +  ​expression 1 
-} +else 
- +  ​expression 2
-if ((2+1) == 4) +
-  ​print("​Arithmetic works."​) ​+
 } }
 </​code>​ </​code>​
  
-Curly brackets ​} are used so that R knows to expect more input. ​ When using brackets, R waits to evaluate the command until the brackets have been closed. ​ If the curly brackets are not used, R may not behave as you are expecting. ​ For example, try:+{:ifelse.png?​300 |}}
  
-<code rsplus>​ +**What ​if you want to test more than one condition?​**
-if ((2 + 1) == 4) print("​Arithmetic works."​) +
-else print("​Houston,​ we have a problem."​) +
-</​code>​+
  
-The ''​else'' ​statement does not work because R evaluates the first line without knowing your command is incomplete. +  * ''​if''​ and ''​if ​else'' ​test a single condition 
- +  * You can also use ''​ifelse''​ function to: 
-Instead, ​use:+    * test a vector of conditions;​ 
 +    * apply a function only under certain conditions.
  
-<code rsplus> 
-if ((2 + 2) == 4) { 
-  print("​Arithmetic works."​) ​ # R does not evaluate this expression yet because the bracket isn't closed. 
-} else { 
-  print("​Houston,​ we have a problem."​) 
-}  # Since all brackets are now closed, R will evaluate the commands. 
-</​code>​ 
-If you want to read in a vector and check each element of the vector one by one without using indices or a loop, you can use the ''​ifelse()''​ function. ​ 
- 
-For example, 
 <code rsplus> <code rsplus>
 a <- 1:10 a <- 1:10
 ifelse(a > 5, "​yes",​ "​no"​) ifelse(a > 5, "​yes",​ "​no"​)
 +
 +a <- (-4):5
 +sqrt(ifelse(a >= 0, a, NA))
 </​code>​ </​code>​
  
-You can also use the ''​ifelse()''​ function within another function to carry out an operation only under certain conditions: 
  
 +**Nested** ''​if else''​ **statement**
  
-For example, 
 <code rsplus> <code rsplus>
-a <- (-4):5 +if (test_expression1{ 
-sqrt(ifelse(a >= 0, a, NA))+statement1 
 +} else if (test_expression2) { 
 +statement2 
 +} else if (test_expression3
 +statement3 
 +} else { 
 +statement4 
 +}
 </​code>​ </​code>​
  
 +{{ :​nested_ifelse.png |}}
 +
 +==== Challenge 1 ==== 
  
------ 
-**Exercise 1** 
  
 <code rsplus> <code rsplus>
Line 144: Line 140:
 </​code>​ </​code>​
  
-1. Use an ''​if''​ statement to print "meow" ​if Paws is a "cat".+1.Use an ''​if''​ statement to print meow” if Paws is a cat”. 
 + 
 +++++ Challenge 1 : Answer| 
 +2.- Use an ''​if else''​ statement to print “woof” if you supply an object that is a “dog” and “meow” if it is not. Try it out with Paws and Scruffy. 
 + 
 +3.- Use the ''​ifelse''​ function to display “woof” for animals that are dogs and “meow” for animals that are cats.
  
-++++ Exercise 1.1 : Answer| 
-<code rsplus> 
-if(Paws == "​cat"​) { 
-  print("​meow"​) 
-} 
-</​code>​ 
 ++++ ++++
  
-2. Use an ''​if''/''​else''​ statement to print "​woof"​ if you supply an object that is a "​dog"​ and "​meow"​ if it is not.  Try it out with Paws and Scruffy.+==== Beware of R’s expression parsing! ==== 
  
 +Use curly brackets ''​{}''​ so that R knows to expect more input. Try:
  
-++++ Exercise 1.2 : Answer| 
 <code rsplus> <code rsplus>
-if(Scruffy == "​dog"​+if (2+2) == print("​Arithmetic works.") 
-  print("​woof"​)  +else print("​Houston, we have a problem.")
-} else { +
-  print("​meow"​) +
-+
- +
-if(Paws ​== "​dog"​) { +
-  ​print("​woof")  +
-else +
-  ​print("​meow") +
-}+
 </​code>​ </​code>​
-++++ 
  
-3. Use the ''​ifelse()''​ function to display "​woof"​ for animals ​that are dogs and "​meow"​ for animals that are cats.+//This doesn'​t work because R evaluates ​the first line and doesn't know that you are going to use an else statement//​ 
 + 
 +Instead use:
  
-++++ Exercise 1.3 : Answer| 
 <code rsplus> <code rsplus>
-ifelse(animals ​== "dog", "woof", ​"meow")+if (2+2 == 4) { #<< 
 +  print("Arithmetic works."
 +} else { #<< 
 +  print("Houstonwe have a problem.") 
 +} #<<
 </​code>​ </​code>​
-++++ + 
------ +==== Remember the logical operators ====  
-==== Loops ====+ 
 +== | equal to | 
 +| !| not equal to | 
 +| !x | not x | 
 +| < | less than | 
 +| < | less than or equal to | 
 +| > | greater than | 
 +| >| greater than or equal to | 
 +| x & y | x AND y | 
 +| x%%|%%y| x OR y | 
 +| isTRUE(x) | test if X is true | 
 + 
 + 
 +==== Iteration ====  
 + 
 +Every time some operations have to be repeated, a loop may come in handy
  
 Loops are good for: Loops are good for:
 +
   * doing something for every element of an object   * doing something for every element of an object
   * doing something until the processed data runs out   * doing something until the processed data runs out
Line 190: Line 196:
   * iterating a calculation until it converges   * iterating a calculation until it converges
  
-===for ​loops===+==== for loop ==== 
  
-The ''​for'' ​loop is the most common type of loop.  Use a ''​for''​ loop to execute a block of code a known number of times.+for loop works in the following way:
  
-**Syntax** 
 <code rsplus> <code rsplus>
-for (variable ​in sequence) { +for(val in sequence) { 
-  ​expression +  ​statement 
-}+  }
 </​code>​ </​code>​
  
-Each time the series of commands in a loop are executed, it is known as an iteration +{{ :forloop.png?300 |}}
  
-For example: 
-<code rsplus> 
-for (i in 1:5) { 
-  print(i) 
-} 
-</​code>​ 
- 
-In this example, our sequence has 5 elements (1, 2, 3, 4, 5). R will then evaluate the expression 5 times. The variable ''​i''​ is specific to our loop and will be only accessible inside it. Its value will be each of the element of our sequence. Therefore, in the first iteration, R will replace each instance of ''​i''​ with 1.  In the second iteration ''​i''​ would be replaced with 2, and so on.  ​ 
  
-The letter ​'i' ​can be replaced with any variable name and the sequence can be almost anything, even a list of vectors. +The letter i can be replaced with any variable name and the sequence can be almost anything, even a list of vectors.
- +
-Try:+
  
 <code rsplus> <code rsplus>
-for (m in 4:10) { +# Try the commands below and see what happens:
-  print(m * 2)  +
-}+
  
 for (a in c("​Hello",​ "​R",​ "​Programmers"​)) { for (a in c("​Hello",​ "​R",​ "​Programmers"​)) {
-  print(a) ​+  print(a)
 } }
  
 for (z in 1:30) { for (z in 1:30) {
-  a <- rnorm(n = 1, mean = 5, sd = 2) # draw a value from a normal distribution with mean 5 and standard deviation 2+  a <- rnorm(n = 1, mean = 5, sd = 2)
   print(a)   print(a)
 } }
Line 236: Line 229:
 </​code>​ </​code>​
  
-Loops are often used to loop over a dataset. ​ We will use loops to perform functions on the CO2 dataset which is built in to R. The CO2 dataset contains concentration and uptake values for plants located in Quebec and Mississippi for an experiment on the cold tolerance of the grass species //​Echinochloa crus-galli//​. Half the plants of each type were chilled overnight before the experiment was conducted. Note that this is the dataset we used in workshop 2.  The code below provides a couple of examples of how loops can be used. 
  
 +In the example below, R would evaluate the expression 5 times:
  
 <code rsplus> <code rsplus>
-data(CO2)+for(i in 1:5) { 
 +  expression 
 +
 +</​code>​ 
 + 
 +In the example, every instance of m is being replaced by each number between 1:10, until it reaches the last element of the sequence. 
 + 
 + 
 +<code rsplus>​ 
 +for(m in 1:10) { 
 +  print(m*2) 
 +
 +</​code>​ 
 + 
 +<code rsplus>​ 
 +for(m in 1:5) { 
 +  print(m*2) 
 +
 +</​code>​ 
 + 
 +<code rsplus>​ 
 +for(m in 6:10) { 
 +  print(m*2) 
 +
 +</​code>​ 
 + 
 + 
 +<code rsplus>​ 
 +x <- c(2,​5,​3,​9,​6) 
 +count <- 0 
 +for (val in x) { 
 +  if(val %% 2 == 0) { 
 +    count = count+1  
 +  } 
 +
 +print(count) 
 +</​code>​ 
 + 
 +{{ :​forexample.png?​600 |}} 
 + 
 +For loops are often used to loop over a dataset. We will use loops to perform functions on the CO2 dataset which is built in R. 
 + 
 +<code rsplus>​ 
 +data(CO2) ​# This loads the built in dataset 
 for (i in 1:​length(CO2[,​1])) { # for each row in the CO2 dataset for (i in 1:​length(CO2[,​1])) { # for each row in the CO2 dataset
-  print(CO2$conc[i]) #print the CO2 concentration+  print(CO2$conc[i]) # print the CO2 concentration
 } }
  
 for (i in 1:​length(CO2[,​1])) { # for each row in the CO2 dataset for (i in 1:​length(CO2[,​1])) { # for each row in the CO2 dataset
   if(CO2$Type[i] == "​Quebec"​) { # if the type is "​Quebec"​   if(CO2$Type[i] == "​Quebec"​) { # if the type is "​Quebec"​
-    print(CO2$conc[i]) #print the CO2 concentration ​} +    print(CO2$conc[i]) # print the CO2 concentration  
-  }+    }
 } }
 +</​code>​
  
-Tip 1 : to get the number of rows of a data frame, we can also use the function nrow+**Tip 1.** To loop over the number of rows of a data frame, we can use the function ​''​nrow()''​ 
 + 
 +<code rsplus>
 for (i in 1:​nrow(CO2)) { # for each row in the CO2 dataset for (i in 1:​nrow(CO2)) { # for each row in the CO2 dataset
-  print(CO2$conc[i]) #print the CO2 concentration+  print(CO2$conc[i]) # print the CO2 concentration
 } }
 +</​code>​
  
-Tip 2 If we want to perform operations on only the elements of one column, we can directly +**Tip 2.** If we want to perform operations on the elements of one column, we can directly iterate over it 
-iterate over it. + 
-for (in CO2$conc) { # for every element ​of the concentration ​column of the CO2 dataset +<code rsplus> 
-  print(i) # print the ith element+for (in 1:CO2$conc) { # for each row of the column ​"​conc"​of the CO2 df 
 +  print(p) # print the p-th element
 } }
 </​code>​ </​code>​
  
-The expression ​part of the loop can be almost anything and is usually a compound statement containing many commands.+**Tip 3.** The expression ​within ​the loop can be almost anything and is usually a compound statement containing many commands.
  
 <code rsplus> <code rsplus>
Line 272: Line 314:
 </​code>​ </​code>​
  
-===while loops and repeat ​loops ===+==== for loops within for loops ====
  
-''​while''​ loops and ''​repeat''​ loops operate similarly to ''​for''​ loops. ​ Once you understand how ''​for''​ loops work, you should be able to use any type of loop.  You will see some examples of ''​while''​ loops and ''​repeat''​ loops in the next section. ​ Note that in many cases you can accomplish the same task many different ways-- either using for loops, while loops, or repeat loops as will be demonstrated below. ​  +In some cases, you may want to use nested loops to accomplish a task. When using nested loops, it is important to use different variables as counters for each of your loops. Here we used i and n
- +
-=== Nested loops === +
- +
-In some cases, you may want to use nested loops to accomplish a task.  When using nested loops, it is important to use different variables as counters for each of your loops (here we used i and n).  ​+
  
 <code rsplus> <code rsplus>
-for (i in 1:5) { +for (i in 1:3) { 
-  for (n in 1:5) {+  for (n in 1:3) {
     print (i*n)     print (i*n)
   }   }
 } }
 +</​code>​
  
 +==== Getting good: using the ''​apply()''​ family ==== 
 +
 +R disposes of the ''​apply()''​ function family, which consists of vectorized functions that aim at minimizing your need to explicitly create loops.
 +
 +
 +''​apply()''​ can be used to apply functions to a matrix
 +
 +<code rsplus>
 +(height <- matrix(c(1:​10,​ 21:​30), ​
 +                 nrow = 5, 
 +                 ncol = 4))
 +}
 +apply(X = height, ​
 +      MARGIN = 1, 
 +      FUN = mean)    ​
 +?​apply ​      
 </​code>​ </​code>​
  
 +==== apply() ==== 
  
-=== The apply() family === 
  
-To prevent the problem of growing objects in loops and to facilitate the application of functions on objects like dataframes, R offers us what we will call the apply functions ​(because they all have "​apply"​ in their name...). It is group of functions that will execute another ​function ​on a given object type. Their use only differ depending on the type of object the function is applied ​to or the type of the return value.+''​lapply()''​ applies ​a function to every element ​of a **list**.
  
-The ''​...apply()''​ functions are not always the best choice performance wise as they will usually hide a ''​for''​ loop written in R in their code. Howeverthey can greatly reduce the programming time needed by the ease-of-use they provide.+It may be used for other objects like dataframeslists or vectors.
  
-One of the most popular ​is simply ''​apply()''​ that executes a function on the rows or columns ​of a dataframe or a matrix. This function takes 3 main arguments:​ +The output returned ​is a **list** ​(explaining the “l” in lapplyand has the same number ​of elements as the object ​passed ​to it.
-  * the object ​on which we want to apply the function +
-  * the margin or the subscript on which we want to apply the function1 is for rows, 2 is for columns +
-  * the function to apply +
-  * the eventual arguments to the function supplied+
  
-All apply functions work on the same model. Performance wise, the most interesting are probably ''​lapply()''​ and ''​vapply()''​ since they are primitive function written in C. ''​lapply()''​ returns a list of the same length as the original object. ''​vapply()''​ allows you to specify the format of the value returned by your function. It can be a vector or an array. Other functions include ''​sapply()''​which works as ''​lapply()''​but it tries to simplify the output to the most elementary data structure that is possible, ''​mapply()''​which vectorize arguments to a function that is not usually accepting vectors as arguments, and ''​tapply()'',​ which is useful when we need to break up a vector into groups defined by some classifying factor, compute a function on the subsets, and return the results in a convenient form. You can even specify multiple factors as the grouping variable.+<code rsplus>​ 
 +SimulatedData <- list( 
 +  SimpleSequence = 1:4,  
 +  Norm10 = rnorm(10), 
 +  Norm20 = rnorm(20, 1), 
 +  Norm100 = rnorm(1005)
 +</​code>​
  
-Note that this type of iteration uses vectorisationwe will talk about it later.+<code rsplus>​ 
 +# Apply mean to each element  
 +## of the list  
 +lapply(SimulatedDatamean) 
 + 
 +SimulatedData <- list(SimpleSequence = 1:4,  
 +             ​Norm10 = rnorm(10),  
 +             ​Norm20 = rnorm(20, 1),  
 +             ​Norm100 = rnorm(100, 5)) 
 + 
 +# Apply mean to each element of the list  
 +lapply(SimulatedData,​ mean) 
 +</​code>​ 
 + 
 +==== sapply() ==== 
 + 
 +''​sapply()''​ is a ‘wrapper’ function for ''​lapply()'',​ but returns a simplified output as a vector, instead of a list. 
 + 
 +The output returned is a list (explaining the “l” in lapply) and has the same number of elements as the object passed to it.
  
 <code rsplus> <code rsplus>
-df <- data.frame(1:100101:200+SimulatedData ​<- list(SimpleSequence = 1:4 
-# Sum on rows +             ​Norm10 = rnorm(10) 
-apply(df, 1, sum) +             Norm20 = rnorm(20, 1),  
-# Mean on columns +             Norm100 = rnorm(1005)
-apply(df2, mean+ 
-we can also supply additionnal arguments ​to the function +Apply mean to each element of the list  
-apply(df, 2, mean, na.rm=TRUE+sapply(SimulatedData, mean
-# we can also define a function directly. The first argument is always what  +</​code>​ 
-# we iterate on. Here each row is treated ​as a vector ​of numbers, as we can see with + 
-# the str() function +==== mapply(====  
-apply(df1function(x){str(x)}) + 
-# We can also add other arguments +''​mapply()''​ works as a multivariate version ​of ''​sapply()''​. 
-apply(df1function(xy){x[2] - x[1] + y}y=5+ 
-# We add a column with a categorical variable to use as a factor to group the data +It will apply a given function to the first element of each argument firstfollowed by the second elementand so on. For example: 
-df[,3]<-sample(c("​A"​,"​B"​,"​C"​),nrow(df),replace=T+ 
-We apply the mean() function on column 1 for each group from column 3 +<code rsplus>​ 
-tapply(df[,1],df[,3],mean)+lilySeeds <- c(8065892321
 +poppySeeds ​<- c(2035117779) 
 +  ​ 
 +Output 
 +mapply(sumlilySeedspoppySeeds)
 </​code> ​ </​code> ​
 +
 +
 +==== tapply() ==== 
 +
 +''​tapply()''​ is used to apply a function over subsets of a vector.
 +
 +It is primarily used when the dataset contains dataset contains different groups (i.e. levels/​factors) and we want to apply a function to each of these groups. ​
  
 <code rsplus> <code rsplus>
-a <- list(1:100, 101:200+head(mtcars
-# apply mean to each element of the list + 
-lapply(a, mean)  ​we get a list as a result +# get the mean hp by cylinder groups 
-unlist(lapply(a, ​mean)) # use unlist to get a vector instead +tapply(mtcars$hpmtcars$cyl, FUN = mean) 
-vapply(a, mean, 0# the result of mean is a single number, we tell vapply our result will be a number +</​code>​ 
-</​code> ​ + 
 +==== Challenge 2 ====  
 + 
 + 
 +You have realized that your tool for measuring uptake was not calibrated properly at Quebec sites and all measurements are 2 units higher than they should be.
  
 +1.- Use a loop to correct these measurements for all Quebec sites.
  
------ +2.Use a vectorisation-based method to calculate the mean CO2-uptake in both areas.
-**Exercise 2**+
  
-1. You have realized that your tool for measuring uptake was not calibrated properly at Quebec sites and all measurements are 2 units higher than they should be.  Use a loop to correct these measurements for all Quebec sites. 
  
-2. Use a vectorisation-based method to calculate the mean CO2-uptake in both areas. ​+++++ Challenge ​: Answer|
  
-++++ Exercise 2 Answer|+1.-Using for and if to correct the measurements:
 <code rsplus> <code rsplus>
 for (i in 1:​length(CO2[,​1])) { for (i in 1:​length(CO2[,​1])) {
Line 349: Line 438:
   }   }
 } }
-tapply(CO2$uptake,​CO2$Type,​mean)+</​code>​  
 + 
 +2.- Using tapply() to calculate the mean for each group: 
 + 
 +<​code>​ 
 +tapply(CO2$uptake,​ CO2$Type, mean)
 </​code>​ </​code>​
 ++++ ++++
------+==== Modifying iterations ==== 
  
-Make sure you reload the data so that we are working with the raw data for the rest of the exercise:  +Normally, loops iterate over and over until they finish.
-<file rsplus>​ +
-data(CO2) +
-</​file>​ +
-\\ +
-==== Loop Modifications ====+
  
-Normally, loops iterate over and over until they finish. ​ To change this behavior, ​you can use ''​break''​ which breaks out of the loops execution entirely, or ''​next'',​ which stops executing the current iteration and jumps to the next iteration.+Sometimes ​you may be interested in breaking this behaviour.
  
-For example,+For example, ​you may want to tell R to stop executing the iteration when it reaches a given element or condition.
  
-<file rsplus>+You may also want R to jump certain elements when certain conditions are met.
  
-# Print the CO2 concentrations for "​chilled"​ treatments ​and keep count of how many replications there were.+For this, we will introduce ''​break'',​ ''​next'' ​and ''​while''​.
  
-count <- 0 # count is being set at zero so that we can edit this object in the loop to keep track of how many iterations ​were performed+==== Modifying ​iterations: break ====
  
-for (i in 1:length(CO2[,1])) { +<code rsplus>  
-  if (CO2$Treatment[i] == "​nonchilled"​) next #Skip to next iteration if treatment is nonchilled+for(val in x) { 
 +  if(condition) { break } 
 +  statement 
 +
 +</​code>​  
 + 
 +{{ :break.png |}} 
 + 
 + 
 +==== Modifying iterations: next ====  
 + 
 +<code rsplus>  
 +for(val in x) { 
 +  if(condition) { next } 
 +  statement 
 +
 +</​code>​  
 + 
 +{{ :next.png |}} 
 + 
 +Print the $CO_{2}$ concentrations for "​chilled"​ treatments and keep count of how many replications were done. 
 +<code rsplus>​ 
 +count <- 0 
 +  
 +for (i in 1:nrow(CO2)) { 
 +  if (CO2$Treatment[i] == "​nonchilled"​) next  
 +  ​# Skip to next iteration if treatment is nonchilled
   count <- count + 1   count <- count + 1
-  ​print(CO2$conc[i])+#  ​print(CO2$conc[i])
 } }
 print(count) # The count and print command were performed 42 times. print(count) # The count and print command were performed 42 times.
 +</​code>​
 +
 +<code rsplus> ​
 +sum(CO2$Treatment == "​nonchilled"​)
 +</​code> ​
 +
 +==== Modifying iterations: break ==== 
  
-This could be equivalently written using a repeat loop+This could be equivalently written using a repeat loop and break
  
 +<code rsplus> ​
 count <- 0 count <- 0
 i <- 0 i <- 0
Line 387: Line 510:
       count <- count + 1       count <- count + 1
       print(CO2$conc[i])       print(CO2$conc[i])
-      if (i == length(CO2[,1])) break     # stop looping +      if (i == nrow(CO2)) break     # stop looping 
-    }   +    } 
 print(count) ​ print(count) ​
 +</​code> ​
  
-### This could also be written using a while loop+==== Modifying iterationswhile ====
  
 +This could also be written using a ''​while''​ loop:
 +<code rsplus> ​
 i <- 0 i <- 0
 count <- 0 count <- 0
-while (i < length(CO2[,1]))+while (i < nrow(CO2))
 { {
   i <- i + 1   i <- i + 1
Line 404: Line 529:
 } }
 print(count) ​ print(count) ​
-</file>+</code
  
------ +==== Challenge ​====
-**Exercise ​3**+
  
-1. You have realized that your tool for measuring concentration ​didn'​t ​work properly. ​ At Mississippi sites, concentrations less than 300 were measured correctly but concentrations ​>= 300 were overestimated by 20 units.  Use a loop to correct these measurements for all Mississippi sites.+You have realized that your tool for measuring concentration ​did not work properly. 
 + 
 +At Mississippi sites, concentrations less than 300 were measured correctlybut concentrations ​equal or higher than 300 were overestimated by 20 units
 + 
 +Your mission is to use a loop to correct these measurements for all Mississippi sites. 
 + 
 +Tip. Make sure you reload the data so that we are working with the raw data for the rest of the exercise: 
 + 
 +<code rsplus>  
 +data(CO2) 
 +</​code> ​
  
-++++ Exercise ​3 : Answer| +++++ Challenge ​3: Answer|  
-<file rsplus>​ + 
-for (i in 1:length(CO2[,1])) {+<code rsplus>​ 
 +for (i in 1:nrow(CO2)) {
   if(CO2$Type[i] == "​Mississippi"​) {   if(CO2$Type[i] == "​Mississippi"​) {
     if(CO2$conc[i] < 300) next      if(CO2$conc[i] < 300) next 
Line 419: Line 554:
   }   }
 } }
 +</​code> ​
 +
 +*Note: We could also have written it in this way, which is more concise and clear*
 +
 +<code rsplus> ​
  
-# Note : We could also have written it that way, which is more concise and clear 
 for (i in 1:​nrow(CO2)) { for (i in 1:​nrow(CO2)) {
   if(CO2$Type[i] == "​Mississippi"​ && CO2$conc[i] >= 300) {   if(CO2$Type[i] == "​Mississippi"​ && CO2$conc[i] >= 300) {
Line 426: Line 565:
   }   }
 } }
 +</​code> ​
  
-</​file>​ 
 ++++ ++++
------ 
- 
-Make sure you reload the data so that we are working with the raw data for the rest of the exercise: ​ 
-<file rsplus> 
-data(CO2) 
-</​file>​ 
  
 +==== Edit a plot using for and if ====
  
-**Using flow control to make a complex ​plot**+Let'​s ​plot uptake vs concentration with points of different colors according to their type (Quebec or Mississippi) and treatment (chilled or nonchilled).
  
-The idea here is that we have a dataset we want to plot, with concentration and uptake values, but each point has a type (Quebec or Mississippi) and a treatment ("​chilled"​ or "​nonchilled"​) and we want to plot the points differently for these cases.  ​ 
  
-You can read more about mathematical typesetting with ''?​plotmath'',​ and more about the way that different colors, sizes, rotations, etc. are used in ?par. +<code rsplus>  
- +plot(x = CO2$conc, y = CO2$uptake, type = "​n",​ cex.lab=1.4, ​ 
-<file rsplus> +     xlab = "CO2 concentration",​ ylab = "CO2 uptake"​) ​ 
- +# Type "​n"​ tells R to not actually plot the points.
-head(CO2) # Look at the dataset +
-unique(CO2$Type)  +
-unique(CO2$Treatment) +
- +
-# plot the dataset, showing each type and treatment as a different colour +
- +
-plot(x=CO2$conc,​ y=CO2$uptake,​ type="​n",​ cex.lab=1.4,​ xlab="​CO2 concentration",​ ylab="​CO2 uptake"​) # Type "​n"​ tells R to not actually plot the points. +
-     +
 for (i in 1:​length(CO2[,​1])) { for (i in 1:​length(CO2[,​1])) {
   if (CO2$Type[i] == "​Quebec"​ & CO2$Treatment[i] == "​nonchilled"​) {   if (CO2$Type[i] == "​Quebec"​ & CO2$Treatment[i] == "​nonchilled"​) {
-    points(CO2$conc[i],​ CO2$uptake[i],​ col="​red",​type="​p")+    points(CO2$conc[i],​ CO2$uptake[i],​ col = "​red"​)
   }   }
   if (CO2$Type[i] == "​Quebec"​ & CO2$Treatment[i] == "​chilled"​) {   if (CO2$Type[i] == "​Quebec"​ & CO2$Treatment[i] == "​chilled"​) {
-    points(CO2$conc[i],​ CO2$uptake[i],​ col="​blue"​)+    points(CO2$conc[i],​ CO2$uptake[i],​ col = "​blue"​)
   }   }
   if (CO2$Type[i] == "​Mississippi"​ & CO2$Treatment[i] == "​nonchilled"​) {   if (CO2$Type[i] == "​Mississippi"​ & CO2$Treatment[i] == "​nonchilled"​) {
-    points(CO2$conc[i],​ CO2$uptake[i],​ col="​orange"​)+    points(CO2$conc[i],​ CO2$uptake[i],​ col = "​orange"​)
   }   }
   if (CO2$Type[i] == "​Mississippi"​ & CO2$Treatment[i] == "​chilled"​) {   if (CO2$Type[i] == "​Mississippi"​ & CO2$Treatment[i] == "​chilled"​) {
-    points(CO2$conc[i],​ CO2$uptake[i],​ col="​green"​)+    points(CO2$conc[i],​ CO2$uptake[i],​ col = "​green"​)
   }   }
 } }
 +</​code> ​
  
-</​file>​ 
  
-Note, that there are other ways to create a complex plot.  ggplot2 is a useful package for creating complex plots that was covered in workshop 3.+==== Challenge 4 ==== 
  
------ 
-**Exercise 4** 
  
-1. Generate a plot of showing concentration versus uptake where each plant is shown using a different ​colour ​point. ​ Bonus points for doing it with nested loops!+Generate a plot of showing concentration versus uptake where each plant is shown using a different point.
  
-++++ Exercise 4 : Answer| +**Bonus ​points** for doing it with nested loops!
-<file rsplus>​ +
-plot(x=CO2$conc,​ y=CO2$uptake,​ type="​n",​ cex.lab=1.4,​xlab="​CO2 concentration",​ ylab="​CO2 uptake"​) # Type "​n"​ tells R to not actually plot the points.+
  
 +
 +++++ Challenge 4 Answer | 
 +
 +<code rsplus>
 +plot(x = CO2$conc, y = CO2$uptake, type = "​n",​ cex.lab=1.4,​ xlab = "CO2 concentration",​ ylab = "CO2 uptake"​) # Type "​n"​ tells R to not actually plot the points.
 plants <- unique(CO2$Plant) plants <- unique(CO2$Plant)
  
-for (i in 1:​length(CO2[,​1])){+for (i in 1:​nrow(CO2)){ for (p in 1:length(plants)) { if (CO2$Plant[i] == plants[p]) { points(CO2$conc[i], 
 + ​CO2$uptake[i],  
 +col = p) } } } 
 +</​code>​ 
 + 
 +<code rsplus>  
 +plot(x = CO2$conc, y = CO2$uptake, type = "​n",​ cex.lab = 1.4,  
 +     xlab = "CO2 Concentration",​ ylab = "CO2 Uptake"​)  
 + 
 +plants <- unique(CO2$Plant) 
 +  
 +for (i in 1:nrow(CO2)){
   for (p in 1:​length(plants)) {    for (p in 1:​length(plants)) { 
     if (CO2$Plant[i] == plants[p]) {     if (CO2$Plant[i] == plants[p]) {
-      points(CO2$conc[i],​ CO2$uptake[i],​ col=p, type="​p"​)+      points(CO2$conc[i],​ CO2$uptake[i],​ col = p)
     }     }
   }   }
 } }
-</file>+</code
 ++++ ++++
------ 
  
-===== 2. Writing ​Functions =====+==== Writing ​functions ​====
  
-==== Why write functions? ​====+** Why write functions? ​**
  
 Much of the heavy lifting in R is done by functions. They are useful for: Much of the heavy lifting in R is done by functions. They are useful for:
-  * performing a task repeatedly, but configurably 
-  * making your code more readable 
-  * make your code easier to modify and maintain 
-  * sharing code between different analyses 
-  * sharing code with other people 
-  * modifying R’s built-in functionality 
  
 +- Performing a task repeatedly, but configurably;​
  
-But what exactly is a function? A function is essentially a black box that transforms data. It takes variables as entry called arguments - use R code to process them and then can optionally give back a return value.+Making your code more readable;
  
-{{::​fonction_schema_eng2.png|200}}+- Make your code easier to modify and maintain;
  
-==== How to write functions? ====+- Sharing code between different analyses;
  
-Here is the basic syntax of a function:+- Sharing code with other people;
  
-<code rsplus>+- Modifying R’s built-in functionality. 
 + 
 +=== What is a function? ==== 
 + 
 +=== Syntax of a function ===  
 +<code rsplus> ​
 function_name <- function(argument1,​ argument2, ...) { function_name <- function(argument1,​ argument2, ...) {
   expression... ​ # What we want the function to do   expression... ​ # What we want the function to do
-  return(value) ​ # Optional. If you want to access to the result of your function+  return(value) ​ # Optional
 } }
-</​code>​+</​code> ​
  
-=== Arguments ===+=== Arguments ​of a function ​=== 
  
-Arguments are the entry values of your function. They are the information your function needs to be able to perform correctly. A function can have between 0 and an infinity of arguments. +<code rsplus>  
-From a technical standpoint, arguments are variables like any other so you can use them as such. The only difference is that they are available only inside your function. Their value will be determined at the moment your function will be called. +function_name ​<- function(argument1, argument2, ...) { 
- +  ​expression... ​  
-Let's start with a really basic function we will call **print_number** that will take as argument a number and will print it. +  return(value
- +
-<code rsplus>​ +
-print_number ​<- function(number) { +
-  ​print(number)+
 } }
 </​code> ​ </​code> ​
  
-Now to use it we call it like any other function.+Arguments are the entry values of your function ​and will have the information your function needs to be able to perform correctly.
  
-<code rsplus>​ +A function can have between 0 and an infinity of arguments. See the following example: ​
-print_number(2) +
-print_number(231) +
-</​code>​+
  
-We can use more than one arguments. For example let's create a function that take a number1, add it to a number2, multiply the result by a number3 and then prints the result. +<code rsplus> ​
- +
-<code rsplus>+
 operations <- function(number1,​ number2, number3) { operations <- function(number1,​ number2, number3) {
   result <- (number1 + number2) * number3   result <- (number1 + number2) * number3
   print(result)   print(result)
 } }
- 
 operations(1,​ 2, 3) operations(1,​ 2, 3)
-operations(17,​ 23, 2) 
 </​code> ​ </​code> ​
  
-The expression part of our function - its body - can be virtually anything. It can be single R statements, loops, ''​if''/''​else''​ conditions etc.+==== Challenge 5 ====
  
------ +Using what you learned previously on flow control, create a function print_animal that takes an animal as argument and gives the following results:
-**Exercise 5**+
  
-Using what you learned previously on flow control, create a function **print_animal** that takes an animal as argument and gives the following results :+<code rsplus> ​
  
-<code rsplus>+print_animal ​<- function(animal) { 
 +  if (animal == "​dog"​) { 
 +    print("​woof"​) 
 +  } else if (animal == "​cat"​) { 
 +    print("​meow"​) 
 +  } 
 +}
 Scruffy <- "​dog"​ Scruffy <- "​dog"​
 Paws <- "​cat"​ Paws <- "​cat"​
 + 
 print_animal(Scruffy) print_animal(Scruffy)
-[1] "​woof"​ + 
 print_animal(Paws) print_animal(Paws)
-[1] "​meow"​ +</​code> ​ 
-</​code>​+ 
 +++++ Challenge 5: Answer |
  
-++++ Exercise 5 : Answer| +<code rsplus> ​
-<file rsplus| Exercise5 Answer>+
 print_animal <- function(animal) { print_animal <- function(animal) {
   if (animal == "​dog"​) {   if (animal == "​dog"​) {
Line 581: Line 710:
   }   }
 } }
-</file>+</code
 ++++ ++++
------ 
  
-Arguments can also be optional and be provided with a default value. This is useful when using a function ​often with the same settings ​as it prevents the need from writing all the arguments all the time but still provides the flexibility ​to be able to change ​it if needed.+==== Default values in a function ==== 
 + 
 +Arguments can also be optional and be provided with a default value. 
 + 
 +This is useful when using a function with the same settingsbut still provides the flexibility to change ​its values, ​if needed.
  
 <code rsplus> <code rsplus>
-operations <- function(number1,​ number2, number3=3) {+operations <- function(number1,​ number2, number3 = 3) {
   result <- (number1 + number2) * number3   result <- (number1 + number2) * number3
   print(result)   print(result)
 } }
- +  
-operations(1,​ 2, 3) # becomes ​equivalent to+operations(1,​ 2, 3) # is equivalent to
 operations(1,​ 2) operations(1,​ 2)
 operations(1,​ 2, 2) # we can still change the value of number3 if needed operations(1,​ 2, 2) # we can still change the value of number3 if needed
 +</​code> ​
  
-</​code>​ +==== Argument ''​...'' ​==== 
- +
-R also provides a special argument "**...**". It allows you to tell R that your function will accept an indefinite number of arguments. This is useful for two main things: +
- +
-  * Pass on arguments to another function used inside your function. This allows you to be able to use all the arguments of other functions without having to define them when creating yours. Let's create a function from our previous example where we plot the CO2 uptake based on the concentration. Here we will plot with 2 different colors depending on the region. Parameters to ''​plot()''​ and ''​points()''​ will be passed on via "​..."​. +
- +
-<code rsplus>+
  
-plot.CO2 <- function(CO2, ​...) { +The special argument ''​...''​ allows you to pass on arguments to another function used inside your functionHere we use ''​...'' ​to pass on arguments to ''​plot()''​ and ''​points()'' ​
-  plot(x=CO2$conc,​ y=CO2$uptake,​ type="​n",​ ...)  # We use ... to pass on arguments to plot()+
  
 +<code rsplus> ​
 +plot.CO2 <- function(CO2,​ ...) { #<<
 +  plot(x=CO2$conc,​ y=CO2$uptake,​ type="​n",​ ...) #<<
   for (i in 1:​length(CO2[,​1])){   for (i in 1:​length(CO2[,​1])){
      if (CO2$Type[i] == "​Quebec"​) {      if (CO2$Type[i] == "​Quebec"​) {
-       ​points(CO2$conc[i],​ CO2$uptake[i],​ col="​red",​ type="​p",​ ...) # same for points+       ​points(CO2$conc[i],​ CO2$uptake[i],​ col = "​red",​ type = "​p",​ ...) #<<
      } else if (CO2$Type[i] == "​Mississippi"​) {      } else if (CO2$Type[i] == "​Mississippi"​) {
-       ​points(CO2$conc[i],​ CO2$uptake[i],​ col="​blue",​ type="​p",​ ...) # same for points()+       ​points(CO2$conc[i],​ CO2$uptake[i],​ col = "​blue",​ type = "​p",​ ...) #<<
      }      }
   }   }
 } }
 +plot.CO2(CO2,​ cex.lab=1.2,​ xlab="​CO2 concentration",​ ylab="​CO2 uptake"​) #<<
 +plot.CO2(CO2,​ cex.lab=1.2,​ xlab="​CO2 concentration",​ ylab="​CO2 uptake",​ #<<
 +         ​pch=20) #<<
 +</​code> ​
  
-plot.CO2(CO2, cex.lab=1.4, xlab="​CO2 concentration",​ ylab="​CO2 uptake"​) +The special argument ''​...''​ allows you to input an indefinite number of arguments.
-plot.CO2(CO2, cex.lab=1.4,​ xlab="​CO2 concentration",​ ylab="​CO2 uptake",​ pch=20)+
  
-</​code>​ +<code rsplus>  
- +sum2 <- function(...){ ​#<< 
-> NOTE : It is really important to note that while using "​...",​ to avoid any confusion, we will have to pass arguments by name. +  args <- list(...) ​#<<
- +
-  * Allow the user to input an indefinite number of arguments. The value of each argument will then have to be checked manually. This is done by transforming "​..."​ as a list and iterating over it. Let's create a sum function that takes an indefinite number of arguments. +
- +
-<code rsplus>​ +
-sum2 <- function(...){ +
-  args <- list(...)+
   result <- 0   result <- 0
   for (i in args)  {   for (i in args)  {
Line 635: Line 761:
   return (result)   return (result)
 } }
 + 
 +sum2(2, 3) #<<
 +sum2(2, 4, 5, 7688, 1) #<<
 +</​code> ​
  
-sum2(2, 3) +==== Return values ====
-sum2(2, 4, 5, 7688, 1) +
-</​code>​+
  
-=== Return ​value ===+The last expression evaluated in a ''​function''​ becomes the return ​value.
  
-As shown in the previous example, ​if we want to be able to save the result of our function ​and be able to use it later, we have to return it at the end using ''​return()''​. It is important to note that only one return value can be given by a function. If you want to return more than one object, you will have to use composite object such as lists or dataframes. +<code rsplus>  
-Also, it is important to note that the function ends once it reaches the ''​return()'' ​keyword+myfun <- function(x) { 
 +  ​if (x < 10) { 
 +    0 
 +  } else { 
 +    10 
 +  } 
 +
 + 
 +myfun(5) 
 +myfun(15) 
 +</​code>​  
 + 
 +''​function()''​ itself returns ​the last evaluated value even without including ​''​return()'' ​function. 
 + 
 +It can be useful ​to explicitly ​''​return()'' ​if the routine should end early, jump out of the function and return a value.
  
 <code rsplus> <code rsplus>
 +simplefun1 <- function(x) {
 +  if (x<​0) ​
 +  return(x)
 +}
 +</​code> ​
 +
 +Functions can return only a single object (and text). But this is not a limitation because you can return a list containing any number of objects.
  
-returntest ​<- function(ab) { +<code rsplus>  
-  ​return (a) # The function exits here +simplefun2 ​<- function(xy) { 
-  a <- b # Not interpreted +  ​<-  
-  return (a + b# Not interpreted+  return(list("​result"​ = z, 
 +              "​x"​ = x, 
 +              "​y"​ = y))
 } }
 +</​code> ​
  
-returntest(2,​ 3) # R will by default print the return value of your function +<code rsplus>​ 
-<- returntest(23# to save it, don't forget to assign it to another variable +simplefun2(12
-c +</​code> ​ 
-</​code>​+ 
 +==== Challenge 6 ====  
 + 
 +Using what you have just learned on functions and control flow, create a function named ''​bigsum''​ that takes two arguments ''​a''​ and ''​b''​ and:
  
------ +Returns 0 if the sum of ''​a''​ and ''​b''​ is strictly less than 50; 
-**Exercise 6**+Else, returns the sum of ''​a''​ and ''​b''​
  
-Using what you learned so far on functions and control flow, create a function **bigsum** that takes two arguments **a** and **b** and : +++++ Challenge 6 Answer | 
-  * returns 0 if the sum of a and b is strictly less than 50 +
-  * returns the sum of a and b otherwise ​+
  
-++++ Exercise 6 : Answer| +Answer ​1:  
-<file rsplus| Exercise6 Answer>+          
 +<code rsplus> ​
 bigsum <- function(a, b) { bigsum <- function(a, b) {
   result <- a + b   result <- a + b
Line 675: Line 829:
   }   }
 } }
-</file>+</code>  
 + 
 +Answer 2:   
 + 
 +<code rsplus>​ 
 +bigsum <- function(a, b) { 
 +  result <- a + b 
 +  if (result < 50) { 
 +  0 
 +  } else { 
 +  result 
 +  } 
 +
 +</code
 ++++ ++++
------ 
-==== Accessibility of variables ==== 
  
-When working with flow control structures and functions, it is essential to always keep in mind where your variables ​are and whether they are defined and accessible.  +==== Accessibility of variables==== 
-Here are some tips to keep in mind+
  
-  * Variables defined inside a function ​are not accessible outside +It is essential to always keep in mind where your variables are, and whether they are defined ​and accessible:
-  * Variables ​defined ​outside a function are accessible ​inside. But it is **NEVER** a good idea to use them inside as your function might not function anymore if the variable doesn'​t exist+
  
-<code rsplus>+Variables defined *inside* a function are not accessible outside of it!
  
-rm(list=ls()) # first let's remove everything to avoid any confusion+Variables defined *outside* a function are accessible inside. But it is NEVER a good idea, as your function will not function if the outside variable is erased. ​
  
 +<code rsplus> ​
 var1 <- 3     # var1 is defined outside our function var1 <- 3     # var1 is defined outside our function
 vartest <- function() { vartest <- function() {
-  a <- 4      # a is defined inside +  a <- 4      # 'a' ​is defined inside 
-  print(a) ​   # print a+  print(a) ​   # print 'a'
   print(var1) # print var1   print(var1) # print var1
 } }
-a             # print a. It doesn't work, can be seen only inside the function+ 
 +a             # ​we cannot ​print 'a' as it exists ​only inside the function 
 vartest() ​    # calling vartest() will print a and var1  vartest() ​    # calling vartest() will print a and var1 
 +
 rm(var1) ​     # remove var1 rm(var1) ​     # remove var1
 vartest() ​    # calling the function again doesn'​t work anymore vartest() ​    # calling the function again doesn'​t work anymore
 +</​code>​
  
-</​code> ​+Use arguments then!
  
-Insteaduse arguments!! Inside ​a function, arguments names will take over other variable names.+Alsoinside ​a function, arguments names will take over other variable names.
  
-<code rsplus>+<code rsplus> ​
 var1 <- 3     # var1 is defined outside our function var1 <- 3     # var1 is defined outside our function
 vartest <- function(var1) { vartest <- function(var1) {
   print(var1) # print var1   print(var1) # print var1
 } }
 +
 vartest(8) ​   # Inside our function var1 is now our argument and takes its value vartest(8) ​   # Inside our function var1 is now our argument and takes its value
-var1          # var1 still has the same value+ 
 +var1
 </​code> ​ </​code> ​
  
-Be very careful when creating variables inside a conditionnal statement as the variable ​can never be created and cause errors+Be very careful when creating variables inside a conditionnal statement as the variable ​may never have been created and cause (sometimes unperceptible) ​errors
  
-<code rsplus>​ +It is good practice to define variables outside the conditions and then modify their values ​to avoid any problem ​ 
-a <- 3 +
-if (a > 5) { +
-  b <- 2 +
-}  +
-a + b    # Error! b doesn'​t exist +
-</​code>​ +
- +
-Usually it is good practice to define variables outside the conditions and then modify their value to avoid any problem+
  
 <code rsplus> <code rsplus>
 a <- 3 a <- 3
-b <- 0 +if (a > 5) { 
-if (a > 5) {+
   b <- 2   b <- 2
  
 +
 a + b a + b
-</​code>​ +</​code> ​
-===== Good practices and how to speed up your code =====+
  
-==== Good practices ====+If you had ''​b''​ already assigned in your environment,​ with a different value, you could have had a *bigger* problem!
  
-Here are some programming tips that can make your life easier, help achieve greater readability ​and makes sharing and reusing your code a lot less painful. Having ​clear to read code will reduce the time you'll spend to understand it so it's never time lost.+No error would have been shown and ''​+ b'' ​would have meant another thing!
  
-=== Keep a clean and nice code ===+==== Good programming practices ==== 
  
-One thing that often helps the most when reading ​programming ​code, is to have a nicely formatted code, well spaced and well indented code. Some programming stardards exist to help achieve a great consistency but usually it ends up to one's personnal preferences. Here are some tips that can help: +Why should I care about programming ​practices?
-  * Use spaces between and after your operators +
-  * Use consistentely the same assignation operator. `<-` is often preferred, `=` is ok but don't switch all the time between the two +
-  * Use brackets when using flow control statements, even if it's for one line. Each statement inside brackets should be indented by two spaces. The closing brackets would usually be all by themselves on a seperate line, except when preceding an else statement. This helps greatly when trying to identify where we are, especially with a lot of nested loops/​conditions. +
-  * Define each variable on its own line +
  
-Here is some hard to read code+- To make your life easier; 
 + 
 +- To achieve greater readability and makes sharing and reusing your code a lot less painful; 
 + 
 +- To reduce the time you will spend to understand your code. 
 + 
 +==== Pay attention to the next tips! ==== 
 + 
 +** Keep a clean and nice code ** 
 + 
 +Proper indentation and spacing ​is the first step to get an easy to read code
 + 
 +  * Use spaces between and after you operators 
 +  * Use consistently the same assignation operator. <- is often preferred. = is OK, but do not switch all the time between the two 
 +  * Use brackets when using flow control statements:​ 
 +    * Inside brackets, indent by *at least* two spaces; 
 +    * Put closing brackets on a separate line, except when preceding an `else` statement.  
 +  * Define each variable on its own line. 
 + 
 +This code is not spaced, and therefore hard to read. All brackets are badly aligned, and it looks "​messy"​. ​
 <code rsplus> <code rsplus>
 a<-4;b=3 a<-4;b=3
 if(a<b){ if(a<b){
-if(a==0)print("​a zero") } else { +if(a==0)print("​a zero"​)}else{ 
-if(b==0){print("​b zero"​)} else print(b)} +if(b==0){print("​b zero"​)}else print(b) 
-</​code>​+</​code> ​
  
-Here is a little easier-to-read version. It takes more space but it is easier to see the flow of the code.+This looks more organized, no?
  
 <code rsplus> <code rsplus>
Line 772: Line 948:
   }   }
 } }
-</​code>​+</​code> ​
  
-There are some coding styles that can be found around the internet. Here is an example that provides a clear code : [[https://​google.github.io/​styleguide/​Rguide.xml|https://​google.github.io/​styleguide/​Rguide.xml]] 
  
-=== Use functions ​whenever possible ​===+==== Use functions ​to simplify your code ==== 
  
-Now that you know how to create functions, try to use them everytime you can. Everytime you see some portion of code that is repeated more than two times in your script, you should be thinking "​Hmmm... would it not be better to write a function ​instead?"​. If only a part of the code change, try thinking of ways to insert them as arguments inside a function instead. This would reduce the number of errors done by copying/​pasting and the time needed to correct them.+Write your own function:
  
-Let's modify the example from exercise 3 and suppose that all CO2 uptake from Mississipi were overestimated by 20 and Quebec underestimated by 50We could write this+1.- When portion of the code is repeated more than twice in your script;
  
-<code rsplus>+2.- If only a part of the code changes and includes options for different arguments. 
 + 
 +This would also reduce the number of potential errors done by copy-pasting,​ and the time needed to correct them. 
 + 
 +------------- 
 + 
 +Let's modify the example from Challenge #3 and suppose that all $CO_2$ uptake from Mississipi plants was overestimated by 20 and Quebec underestimated by 50. We could write this:  
 + 
 +<code rsplus> ​
 for (i in 1:​length(CO2[,​1])) { for (i in 1:​length(CO2[,​1])) {
   if(CO2$Type[i] == "​Mississippi"​) {   if(CO2$Type[i] == "​Mississippi"​) {
Line 793: Line 976:
   }   }
 } }
-</​code>​+</​code> ​
  
-Or we could do this instead.+Or this
  
-<code rsplus>​ +<code rsplus>  
-recalibrate <- function(CO2,​ type, bias) {+recalibrate <- function(CO2,​ type, bias){
   for (i in 1:​nrow(CO2)) {   for (i in 1:​nrow(CO2)) {
     if(CO2$Type[i] == type) {     if(CO2$Type[i] == type) {
Line 804: Line 987:
     }     }
   }   }
-  ​# we have to return our new dataset because the original is not modified +  return(CO2)
-  ​return (CO2)+
 } }
 +</​code> ​
 +
 +<code rsplus>
 newCO2 <- recalibrate(CO2,​ "​Mississipi",​ -20) newCO2 <- recalibrate(CO2,​ "​Mississipi",​ -20)
-# Note that we recalibrate our newCO2 dataset here because the original CO2 is not modified 
 newCO2 <- recalibrate(newCO2,​ "​Quebec",​ +50) newCO2 <- recalibrate(newCO2,​ "​Quebec",​ +50)
 +</​code> ​
  
-</​code>​+==== Use meaningful names for functions ====
  
-And nowwe realize that what we modified was not the uptake ​but the concentration... We now have to change all occurences of //​CO2\$conc[i]//​ by //​CO2\$uptake[i]//​. In the first case, it means we have to change it 4 times, and only 2 times in the second one! (At this point, you might think that this is not much and you can do it with a simple search/​replace and you might be right but this is just a simple example! Imagine you had 10 locations instead of 2. A good programmer is a lazy one. Also, admit it, it looks way cooler with a function...) +Same function as before, but with vague names
- +
  
-=== Give meaningful variable and function names === +<code rsplus> ​
- +
-This help to see at first glance what does what. Be extra careful when choosing your argument names when creating a function since it's what users will see. However it is also good to choose short names to avoid having to type them all the time and making typos so a good balance should be reached. +
- +
-Here is what our previous example could look like with vague names. It now requires a little more work to understand what this function does. +
- +
-<code rsplus>+
 rc <- function(c, t, b) { rc <- function(c, t, b) {
   for (i in 1:nrow(c)) {   for (i in 1:nrow(c)) {
Line 831: Line 1009:
   return (c)   return (c)
 } }
-</​code>​ 
- 
-=== Comments === 
- 
-Even with meaningful names, it's never a bad thing to add comment to describe everything your code does, be it the purpose of a function, how to use its arguments or a detailed step by step of the function. ​ 
- 
-<code rsplus> 
-## recalibrates the CO2 dataset by modifying the CO2 uptake concentration 
-## by a fixed amount depending on the region of sampling 
-# Arguments 
-# CO2: the CO2 dataset 
-# type: the type that need to be recalibrated. Values: "​Mississippi"​ or "​Quebec"​ 
-# bias: the amount to add to the concentration uptake. Use negative values for overestimations ​ 
-recalibrate <- function(CO2,​ type, bias) { 
-  for (i in 1:​nrow(CO2)) { 
-    if(CO2$Type[i] == type) { 
-      CO2$uptake[i] <- CO2$uptake[i] + bias  
-    } 
-  } 
-  # we have to return our new dataset because the original is not modified 
-  return (CO2) 
-} 
- 
-</​code>​ 
- 
-===== 3. Speeding up your code ===== 
- 
-Here are some programming tips to program more efficiently with R and achieve greater performance and faster code. Note that before optimizing your code, you should always make sure to have a working code beforehand. A slow working code is always better than a fast broken implementation. Also, sometimes, optimizing is just not the way to go. If you spend 2 hours rewritting code to gain a few seconds, then it might just not be worth it... 
- 
-==== Before we start : profiling our code ==== 
- 
-If we want to optimize our code, we will need to know how much time each task takes to perform. ​ 
- 
-The simplest way to do that is to use the function ''​system.time(//​expression//​)''​ 
- 
-<code rsplus> 
-system.time({ 
-a <- 0 
-  for (i in 1:1000) { 
-    a <-  a + i 
-  } 
-}) 
-</​code>​ 
- 
-Note that most of the time R works really fast and you will need to have some heavy computing to do or your time might not even be recorded. It is usually recommended to perform iterations of the task you want to profile or work with really big datasets. 
- 
-<code rsplus> 
-system.time(replicate(1000,​ { 
-  a <- 0 
-  for (i in 1:1000) { 
-    a <-  a + i 
-  } 
-})) 
-</​code>​ 
- 
-One other simple and useful tool is the function ''​Rprof()''​. The main advantage of ''​Rprof()''​ is that it saves information about time spent in each function in a file that you can access later. Here's how to use it. 
- 
-<code rsplus> 
-Rprof("​profile.txt"​) ​ # you can change profile.txt by the filename you want 
-for (i in 1:1000) { 
-    a <- 0 
-    for (i in 1:1000) { 
-      a <-  a + i 
-    } 
-  } 
-Rprof() ​              # This ends the profiling 
-summaryRprof("​profile.txt"​) ​ # Use the filename previously recorded to display the summary of the tasks 
-</​code>​ 
- 
-Finally, if you want to compare the efficiency or several functions, a very good tool is the package microbenchmark 
- 
-<code rsplus> 
-install.packages("​microbenchmark"​) 
-library(microbenchmark) 
- 
-f1 <- function() { 
-  a <- 0 
-  for (i in 1:1000) { 
-    a <-  a + i 
-  } 
-} 
- 
-microbenchmark(f1(),​ times=1000) # the argument times allow us to determine how many iterations we want 
-</​code>​ 
- 
-==== First step : thinking a bit! ==== 
- 
-When you look at your code, often you will realize that there are simpler, way more efficient ways to do what you want and that some operations can be easily removed for added speed. 
- 
-For example, let's create a function that takes a number **a**. We will add **a** to every number from 1 to 100, and if **a** is less than 5, then we will add 2*a instead. Then we will calculate the sum of all the elements of the sequence. 
- 
-Here's a way to do it 
- 
-<code rsplus> 
-f2 <- function(a) { 
-  # initialize our result 
-  result <- 0 
-  # iterate on the sequence from 1 to 100 
-  for (i in 1:100) { 
-    if (a < 5) { 
-      # a is < 5, we add 2*a to the sequence element. We save it in result ​ 
-      result <- result + i + (2*a) 
-    } else { 
-      # a is >= 5, we add only a  
-      result <- result + i + a 
-    } 
-  } 
-  return(result) 
-} 
-f2(4) 
 </​code> ​ </​code> ​
  
-This is ok and does what we want. However we have a lot of useless steps in our code. For example, we don't really need to perform our condition on each iteration, as the result will always be the same. +==== Use comments ==== 
-So let's take it out of the loop+
  
-<code rsplus>​ +Add comment to describe what your code doeshow to use its arguments or detailed step-by-step description ​the function.
-f3 <- function(a) { +
-  # initialize our result +
-  result <- 0 +
-   +
-  # Check if a < 5. If true, a becomes 2*a +
-  if (a < 5) { +
-   a <2 * a +
-  }  +
-  # We don't even need an else here since a remains ​the same otherwise +
-   +
-  # iterate on the sequence from 1 to n +
-  for (i in 1:100) { +
-      result <- result + i + a  +
-  } +
-  return(result) +
-+
- +
-f3(4) +
-microbenchmark(f2(4),​  +
-               ​f3(4),​ times=1000) +
- +
-</​code>​  +
- +
-This is just a simple modification but here we sped up our code by almost 40% (results may vary depending on computers)Moreover, our code is easier to read and to understand. Sometimes, we can gain speed and easier code just by thinking a little more of where our conditional statements can be and what they test. +
- +
-But, using the strengths of R, we can do even better!+
  
 <code rsplus> <code rsplus>
-f4 <- function(a, n) { 
-  result <- 0 
-  ​ 
-  if (a < 5) { 
-    a <- a + 1 
-  }  
-  result <- sum(1:n + a) 
-  return(result) 
-} 
  
-f4(4) +# Recalibrates the CO2 dataset by modifying the CO2 uptake concentration 
-microbenchmark(f3(4),​  +# by a fixed amount depending on the region of sampling
-               ​f4(4),​ times=1000)+
  
-</​code>​+# Arguments 
 +# CO2: the CO2 dataset 
 +# type: the type ("​Mississippi"​ or "​Quebec"​) that need to be recalibrated.  
 +# bias: the amount to add or remove to the concentration uptake
  
-Wow, here the improvement is much more efficient... But how exactly did this happen? This leads us to our next point 
- 
-==== Vectorisation ==== 
- 
-This part is a reminder of what you probably already learned in the first workshops. However, this is usually overlooked and forgotten and the bad performance of R code can often be attributed to bad vectorisation. 
-R is meant to function with vectors and many functions in R are optimized for vectorisation. To understand that, it is first important to understand how R works at a low level. R is an interpreted language, which means that when you execute R code, you are in reality sending your code to functions programmed in another language (the C language). This slows down the execution of your programs since first you have to interpret the R code then pass it on to other functions. When you create a loop in R, you have to decode every iteration and pass it on. 
-Vectorised functions on the other hand are functions that perform operations directly on a vector. Essentially,​ they also perform a loop on your vector, but the big difference is that they do it in C, not R and are therefore much faster. ''​sum()''​ is one of these functions. ​ 
-One of the biggest challenge with R is learning to think and program with vectors and not with single elements. For example, most of the basic operations can be done on vectors. 
- 
-<code rsplus> 
-v1 <- 1:5 
-v2 <- 2:6 
-v3 <- 1:3 
-v1 + 2      # Addition on a vector : adds 2 to all elements 
-v1 + v2     # Adds each element of v2 to v 
-v1 + v3     # v1 and v3 are not the same length, then we add from the start of v3 again 
-sum(v1) ​    # Adds all elements of v1 together 
-sum(v1, v2) # Sums all elements of v1 and v2  
-mean(v1) ​   # Average of elements in v1 
-mean(c(v1, v2)) # Average of elements of v1 and v2. Unlike sum, we have to combine them beforehand 
-</​code>​ 
- 
-=== Subsetting === 
- 
-To vectorise efficiently,​ it is also important to be able to extract values from our data quickly. 
-To apply treatment on specific elements of a vector or a dataframe, R offers a subsetting tool that can be sometimes way more efficient and easy to write than a mix of loops and conditions. 
- 
-Subsetting is done via the ''​[''​ and the ''​$''​ (for a dataframe) operators. We can insert directly our conditions inside the ''​[]''​ part of the subsetting to quickly extract values from our data. We can also use the function ''​which()''​ to test for a condition. ''​which()''​ returns the indexes of the elements that matches the condition. ​ 
- 
-<code rsplus> 
-v1 <- 1:10 
-v1[7]      # Extracts the 7th value 
-v1[v1 > 5] # Extracts values > 5 only 
-v1[which(v1 > 5)]  # same as before 
-</​code>​ 
- 
-In dataframes, **$** allows to access a column by name. We can also do this by providing the name of the column directly 
- 
-<code rsplus> 
-CO2 <- read.csv("​co2_good.csv"​) 
-CO2$Type ​ # Prints columns Type 
-CO2[, "​Type"​] # Same as above 
-CO2[CO2$Type == "​Quebec",​ ] #Extracts all rows of the CO2 dataset where the Type is "​Quebec" ​ 
-</​code>​ 
- 
- 
------ 
-**Exercise 7** 
- 
-Create a new function ''​recalibrate2()''​ that is a rewrite the function recalibrate seen earlier using subsetting and vectorisation techniques. The new function should not be longer than 3 lines.\\ ​ 
-Reminder: 
- 
-<code rsplus> 
 recalibrate <- function(CO2,​ type, bias) { recalibrate <- function(CO2,​ type, bias) {
   for (i in 1:​nrow(CO2)) {   for (i in 1:​nrow(CO2)) {
Line 1050: Line 1031:
     }     }
   }   }
-  return (CO2)+  return(CO2)
 } }
-</​code>​ 
  
-++++ Exercise 7 : Answer| 
-<code rsplus> 
-recalibrate2 <- function(CO2,​ type, bias) { 
-  ​ 
-  # First get the indexes of the data with the good type 
-  # Thinking tip : since we use the indexes twice below, instead of using which() twice, let's 
-  # do it only once and save the result! 
-  idx <- which(CO2$Type == type) 
-  ​ 
-  # Modify only the data concerned using indexes. 
-  CO2$uptake[idx] <- CO2$uptake[idx] + bias 
-  return (CO2) 
-} 
- 
-# Check the results are the same 
-all.equal(recalibrate(CO2,​ "​Quebec",​ 20), recalibrate2(CO2,​ "​Quebec",​ 20)) 
- 
-# Check that this is indeed way faster 
-microbenchmark(recalibrate(CO2,​ "​Quebec",​ 20), 
-               ​recalibrate2(CO2,​ "​Quebec",​ 20)) 
 </​code>​ </​code>​
-++++ 
------ 
  
-==== Growing objects ==== +==== Thank you for attending ​this workshop! ====
- +
-Vectorising is good, but sometimes it is hard to do and it might take you more time to do than just do a simple loop. Sometimes you just need loops and you should use them when you can. But when using them, one thing you should be aware of if you want decent performance are growing objects. That is objects that are getting bigger and bigger with each iteration.  +
-Let's illustrate this really simply by creating a function that iterates over a sequence and create a vector with it. We will compare 2 ways of doing that : by growing our object and by preallocating our result and just modify it.  +
- +
-<code rsplus>​ +
-growing <- function(n) { +
-  # declare our result +
-  result <- NULL +
-  ​for (i in 1:n) { +
-    # create our result by growing our object +
-    result <- c(result, i) +
-  } +
-  return(result) +
-+
- +
-growing2 <- function(n) { +
-  # declare our result : here we create a vector of length n with 0 in it +
-  result <- numeric(n) +
-  for (i in 1:n) { +
-    # now we just modify our value instead of recreating the vector +
-    result[i] <- i +
-  } +
-  return(result) +
-+
-</​code>​ +
- +
-Now let's compare their speeds +
- +
-<code rsplus>​ +
-system.time({ +
-  growing(10000) +
-}) +
-system.time({ +
-  growing2(10000) +
-}) +
-</​code>​ +
- +
-With a vector of 10000 elements, speeds are still comparable and under the second. Now let's just use 50000 elements +
- +
-<code rsplus>​ +
-system.time({ +
-  growing(50000) +
-}) +
-system.time({ +
-  growing2(50000) +
-}) +
-</​code>​ +
- +
-By multiplying by only 5, it suddenly takes us several seconds just to create a single vector while modifying ​ preallocated vector is still almost instantaneous. What happened here? The reason is that when you call a function, arguments are first copied before being passed on to your function. So when you write ''​result <- c(result, i)'' ​ each time, **result** is copied before being passed on to ''​c()''​. As result grows which each iteration, in each iteration it takes more and more time to copy it. The bigger the end object, the more time it will take. This is why it is always better to create your result object before your loop if you already know what size it will be. +
- +
-This is especially valid when working with dataframes and functions such as ''​rbind()''​ and ''​cbind()''​. +
-Unfortunately,​ preallocating dataframes does not work that well and there is a better way, albeit a little more complicated. It is to store each row (or column) in a preallocated list first and then call ''​rbind()''​ (or ''​cbind()''​)on all elements via the function ''​do.call()''​. The function ''​do.call()''​ allows you to execute a named function on a list of arguments. This way ''​rbind()''​ is called only once, at the end, which removes the problem of copying the growing object each time we call it. +
- +
-<code rsplus>​ +
- +
-growingdf <- function(n, row) { +
-  # preallocate our dataframe +
-  df <- data.frame(numeric(n),​ character(n),​ stringsAsFactors=FALSE) +
-  for (i in 1:n) { +
-    # replace the ith row with row +
-    df[i,] <- row  +
-  } +
-  return(df) +
-+
- +
-growingdf2 <- function(n, row) { +
-  # this is the way to allocate a list with n elements +
-  df <- vector("​list",​ n) +
-  for (i in 1:n) { +
-    # put row in the ith element +
-    df[[i]] <- row  +
-  } +
-  return(do.call(rbind,​ df)) +
-+
- +
-# store our row in a list since we have different types +
-row <- list(1, "Hello World"​) +
-microbenchmark(growingdf(5000,​ row), +
-               ​growingdf2(5000,​ row), +
-               ​times=10) +
-</​code>​ +
- +
- +
- +
-======= 4. Quick introduction to useful packages in R ======= +
- +
- +
-===== Knitr ===== +
-Knitr is a package that can be used to generate dynamic reports or web pages from R code. The code is evaluated at the moment the report is generated.  +
- +
-Code can be easily written in RStudio use the Markdown language. :  +
- +
-++++ Example Markdown code | +
-<code rsplus>​ +
---- +
-title: "​Interesting packages for R / QCBS Workshop on R programming"​ +
-output: html_document +
---- +
- +
-### Data table +
-A package to facilitate and to improve the efficiency of certain operations in R. +
-```{r} +
-library(data.table) +
-mydf=data.frame(a=rep(LETTERS,​each=1e5),​b=rnorm(26*1e5)) +
-mydt=data.table(mydf) +
-setkey(mydt,​a) # We set the column that will be used as a key for the data table +
- +
-``` +
-Returns all rows with column a (the key) equal to F  +
-````{r} +
-mydt['​F'​] +
-```` +
- +
-Gives the mean value of column b for each letter in column a.  +
-```{r} +
-mydt[,​mean(b),​by=a] +
-``` +
- +
-Compare +
-```{r} +
-system.time(t1<​-mydt[,​mean(b),​by=a]) +
-``` +
- +
-### With tapply() +
-```{r} +
-system.time(t2<​-tapply(mydf$b,​mydf$a,​mean)) +
-``` +
- +
-### With reshape2 +
-```{r message=FALSE} +
-library(reshape2) +
-meltdf=melt(mydf) +
-system.time(t3<​-dcast(meltdf,​a~variable,​mean)) +
-``` +
- +
-### With plyr +
-```{r} +
-library(plyr) +
-system.time(t4<​-ddply(mydf,​.(a),​summarize,​mean(b))) +
-``` +
- +
-### With dplyr +
-```{r message=FALSE} +
-library(dplyr) +
-ti1<​-proc.time() +
-groups <- group_by(mydf,​ a) +
-t4b <- summarise(groups,​ total = mean(b)) +
-eltime<​-proc.time()-ti1 +
-eltime +
-``` +
- +
-### With sqldf +
-```{r} +
-library(sqldf) +
-system.time(t5<​-sqldf('​SELECT a, avg(b) FROM mydf GROUP BY a')) +
-``` +
- +
-### With a for loop +
-```{r} +
-ti1<​-proc.time() +
-t6<​-data.frame(letter=unique(mydf$a),​mean=rep(0,​26)) +
-for (i in t6$letter ){ +
-  t6[t6$letter==i,​2]=mean(mydf[mydf$a==i,​2]) +
-+
-eltime<​-proc.time()-ti1 +
-eltime +
-``` +
- +
- +
-### With a parallelized FOR loop +
-```{r} +
-library(foreach) +
-library(doMC) +
-registerDoMC(4) #Four-core processor +
-ti1<​-proc.time() +
-t7<​-data.frame(letter=unique(mydf$a),​mean=rep(0,​26)) +
-t7[,2] <- foreach(i=t7$letter,​ .combine='​c'​) %dopar% { +
- ​mean(mydf[mydf$a==i,​2]) +
-+
-eltime<​-proc.time()-ti1 +
-eltime +
-``` +
- +
- +
- +
-### RgoogleMaps +
-```{r message=FALSE} +
-library(RgoogleMaps) +
-myhome=getGeoCode('​Olympic Stadium, Montreal'​);​ +
-mymap<​-GetMap(center=myhome,​ zoom=14) +
-PlotOnStaticMap(mymap,​lat=myhome['​lat'​],​lon=myhome['​lon'​],​cex=5,​pch=10,​lwd=3,​col=c('​red'​));​ +
-``` +
- +
- +
-### Taxize +
-```{r message=FALSE,​ warning=FALSE} +
-library(taxize) +
-spp<​-tax_name(query=c("​american beaver"​),​get="​species",​ db = '​ncbi'​) +
-spp +
-fam<​-tax_name(query=c("​american beaver"​),​get="​family",​ db = '​ncbi'​) +
-fam +
-correctname <- tnrs(c("​fraxinus americanus"​)) +
-correctname +
-cla<​-classification("​acer rubrum",​ db = '​itis'​) +
-cla +
-``` +
- +
-### spocc +
-```{r message=FALSE} +
-library(spocc) +
-occ_data <- occ(query = 'Acer nigrum',​ from = '​gbif'​) +
-mapggplot(occ_data) +
- +
-``` +
- +
- +
-### Combine spocc and RgoogleMaps +
-```{r message=FALSE,​ warning=FALSE} +
-occ_data <- occ(query = 'Puma concolor',​ from = '​gbif'​) +
-occ_data_df=occ2df(occ_data) +
-occ_data_df<​-subset(occ_data_df,​!is.na(latitude) & latitude!=0) +
-mymap<​-GetMap(center=c(mean(occ_data_df$latitude),​mean(occ_data_df$longitude)),​ zoom=2) +
-PlotOnStaticMap(mymap,​lat=occ_data_df$latitude,​lon=occ_data_df$longitude,​cex=1,​pch=16,​lwd=3,​col=c('​red'​));​ +
-``` +
- +
- +
-### geonames +
-```{r message=FALSE} +
-library(geonames) +
-options(geonamesUsername="​glaroc"​) +
-res<​-GNsearch(q="​Mont Saint-Hilaire"​) +
-res[,​c('​toponymName','​fclName'​)] +
-dc<​-GNcities(45.4,​ -73.55, 45.7, -73.6, lang = "​en",​ maxRows = 10) +
-dc[,​c('​toponymName'​)] +
-``` +
-</​code>​ +
- +
-++++ +
-[[http://​qcbs.ca/​wp-content/​uploads/​2014/​12/​QCBS_Workshop8_knitr.html|View the resulting web page]].  +
- +
- +
-===== magrittr ===== +
-This packages provides a mechanism for chaining commands with a new forward-pipe operator, %>%. +
-<code rsplus>​ +
-iris %>% +
-subset(Sepal.Length > mean(Sepal.Length)) %$% +
-cor(Sepal.Length,​ Sepal.Width) +
-</​code>​ +
-===== Data Table ===== +
-[[http://​cran.r-project.org/​web/​packages/​data.table/​index.html|Data table]] is a very useful package which can facilitate and improve the efficiency of certain operations in R. Data tables are just like data frames. You can even create them from data frames.  +
- +
-[[http://​cran.r-project.org/​web/​packages/​data.table/​vignettes/​datatable-intro.pdf|Introduction to Data table (PDF)]] +
- +
-<code rsplus>​ +
-install.packages('​data.table'​) +
-library(data.table) +
-</​code>​ +
- +
-Generate very long data frame with one column with letters, and one column with random numbers +
-<file rsplus>​ +
-mydf<​-data.frame(a=rep(LETTERS,​each=1e5),​b=rnorm(26*1e5)) +
-</​file>​ +
- +
-Convert the data frame to a data table format.  +
-<file rsplus>​ +
-mydt<​-data.table(mydf) +
-</​file>​ +
- +
-Each data table has to be assigned a key, which is one (or more) of the columns from the table. This key defines the basis for the organization and the sorting of the table.  +
-<file rsplus>​ +
-setkey(mydt,​a) +
-</​file>​ +
- +
-Once the key is set, we can return all rows with column a (the key) equal to F  +
-<file rsplus>​ +
-mydt['​F'​] +
-</​file>​ +
- +
-Gives the mean value of column b for each letter in column a.  +
-<file rsplus>​ +
-mydt[,​mean(b),​by=a] +
-</​file>​ +
- +
-Let's compare the performance of Data table with other methods to achieve the same thing.  +
-<file rsplus>​ +
-system.time(t1<​-mydt[,​mean(b),​by=a]) +
-</​file>​ +
- +
-**With tapply()** +
-<file rsplus>​ +
-system.time(t2<​-tapply(mydf$b,​mydf$a,​mean)) +
-</​file>​ +
- +
-**With [[http://​cran.r-project.org/​web/​packages/​reshape2/​index.html|reshape2]]** +
- +
-**NOTE**: plyr and reshape2 where covered in [[r_workshop3|Workshop 3]].  +
-<file rsplus>​ +
-library(reshape2) +
-meltdf<​-melt(mydf) +
-system.time(t3<​-dcast(meltdf,​a~variable,​mean)) +
-</​file>​ +
- +
-**With [[http://​cran.r-project.org/​web/​packages/​plyr/​index.html|plyr]]** +
-, a set of tools to split up a data into homogeneous pieces, apply a function to each piece and combine all the results back together. +
-<file rsplus>​ +
-library(plyr) +
-system.time(t4<​-ddply(mydf,​.(a),​summarize,​mean(b))) +
-</​file>​ +
- +
-**With [[http://​cran.rstudio.com/​web/​packages/​dplyr/​vignettes/​introduction.html|dplyr]]** +
-, a newer and faster version of plyr, which is adapted to work only on data frames.  +
-<file rsplus>​ +
-library(dplyr) +
-ti1<​-proc.time() +
-groups <- group_by(mydf,​ a) +
-t5 <- summarise(groups,​ total = mean(b)) +
-eltime<​-proc.time()-ti1 +
-</​file>​ +
- +
-**With [[http://​cran.r-project.org/​web/​packages/​sqldf/​index.html|sqldf]]**. This package allows one to write Structured Query Language commands to perform queries on data frames.  +
-<file rsplus>​ +
-library(sqldf) +
-system.time(t6<​-sqldf('​SELECT a, avg(b) FROM mydf GROUP BY a')) +
-</​file>​ +
- +
-**With a basic FOR loop** +
-<file rsplus>​ +
-ti1<​-proc.time() +
-# Initialize an empty data frame with two columns and 26 rows +
-t7<​-data.frame(letter=unique(mydf$a),​mean=rep(0,​26)) +
-for (i in t6$letter ){ +
-  t7[t7$letter==i,​2]=mean(mydf[mydf$a==i,​2]) +
-+
-eltime<​-proc.time()-ti1 +
-eltime +
-</​file>​ +
- +
-**With a parallelized FOR loop** +
-Use [[http://​cran.r-project.org/​web/​packages/​foreach/​index.html|foreach]] and [[http://​cran.r-project.org/​web/​packages/​doMC/​index.html|doMC]] packages to run sections of code in parallel on computers with multiple cores. This is particularly suited to speed up some calculations involving FOR loops in which every iteration can be run independently of other iterations. Note that the doMC package may not work on computers running Windows. It should work on Linux or Mac OSX.  +
-<file rsplus>​ +
-library(foreach) +
-library(doMC) +
-registerDoMC(4) #Four-core processor +
-ti1<​-proc.time() +
-t8<​-data.frame(letter=unique(mydf$a),​mean=rep(0,​26)) +
-t8[,2] <- foreach(i=t8$letter,​ .combine='​c'​) %dopar% { +
- ​mean(mydf[mydf$a==i,​2]) +
-+
-eltime<​-proc.time()-ti1 +
-eltime +
-</​file>​ +
-===== RgoogleMaps! ===== +
-The [[http://​cran.r-project.org/​web/​packages/​RgoogleMaps/​index.html|RgoogleMaps]] package allows to very simply show Google maps or Google Satellite images in R, centered and zoomed on a location of your choice. You can also relatively easily overlay some spatial data from your R workspace on the map (see below). The getGeocode function tranforms a text search for a postal code or place name to latitude, longitude coordinates using Google web services.  +
-<file rsplus>​ +
-library(RgoogleMaps) +
-myhome=getGeoCode('​McGill University, Montreal'​);​ +
-mymap<​-GetMap(center=myhome,​ zoom=14) +
-PlotOnStaticMap(mymap,​lat=myhome['​lat'​],​lon=myhome['​lon'​],​cex=5,​pch=10,​lwd=3,​col=c('​red'​));​ +
-</​file>​ +
-===== Taxize ===== +
-The [[http://​ropensci.org/​|rOpenSci]] project supports the development of a number of R packages to facilitate access to a number of online data sources. Among them is the package [[http://​cran.r-project.org/​web/​packages/​taxize/​index.html|Taxize]],​ which can be used to get taxonomic information from many different databases, including taxonomic synonyms, hierarchies,​ common names, and more, from a dozen different sources. +
-<file rsplus>​ +
-library(taxize) +
-spp<​-tax_name(query=c("​american beaver"​),​get="​species"​) +
-fam<​-tax_name(query=c("​american beaver"​),​get="​family"​) +
-correctname <- tnrs(c("​fraxinus americanus"​)) +
-cla<​-classification("​acer rubrum",​ db = '​itis'​) +
-</​file>​ +
-===== Spocc ===== +
-Another useful package from rOpenSci is [[http://​cran.r-project.org/​web/​packages/​spocc/​index.html|Spocc]] which can be used to search for species occurrence data from a number of sources, including the [[http://​www.gbif.org/​|Global Biodiversity Information Facility]], a worldwide database containing hundreds of millions of species occurrences from collections and field data.  +
-<file rsplus>​ +
-library(spocc) +
-occ_data <- occ(query ​'Acer nigrum',​ from '​gbif'​) +
-mapggplot(occ_data) +
-</​file>​+
  
-Combine spocc and RgoogleMaps 
-<file rsplus> 
-occ_data <- occ(query = 'Puma concolor',​ from = '​gbif'​) 
-occ_data_df=occ2df(occ_data) 
-occ_data_df<​-subset(occ_data_df,​!is.na(latitude) & latitude!=0) 
-mymap<​-GetMap(center=c(mean(occ_data_df$latitude),​mean(occ_data_df$longitude)),​ zoom=2) 
-PlotOnStaticMap(mymap,​lat=occ_data_df$latitude,​lon=occ_data_df$longitude,​cex=1,​pch=16,​lwd=3,​col=c('​red'​));​ 
-</​file>​ 
-===== geonames ===== 
-[[http://​geonames.r-forge.r-project.org/​|Geonames]] connects R to [[http://​geonames.org|Geonames.org]],​ an online database of place names and toponyms. ​ 
-<file rsplus> 
-library(geonames) 
-options(geonamesUsername="​glaroc"​) 
-# Retrieve place names which contain the name "Mont Saint-Hilaire"​ 
-res<​-GNsearch(q="​Mont Saint-Hilaire"​) 
-res[,​c('​toponymName','​fclName'​)] 
-#Extract all citites within a bouding box defined by the coordinates of the four corners. ​ 
-dc<​-GNcities(45.4,​ -73.55, 45.7, -73.6, lang = "​en",​ maxRows = 10) 
-dc[,​c('​toponymName'​)] 
-</​file>​