Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
r_workshop3 [2018/01/12 15:15]
mariehbrice [3. Data manipulation with dplyr]
r_workshop3 [2018/11/03 10:09] (current)
mariehbrice [Workshop 3: Intro to ggplot2]
Line 7: Line 7:
  
 //The content of this workshop has been peer-reviewed by several QCBS members. If you would like to suggest modifications,​ please contact the current series coordinators,​ listed on the main wiki page// //The content of this workshop has been peer-reviewed by several QCBS members. If you would like to suggest modifications,​ please contact the current series coordinators,​ listed on the main wiki page//
-====== Workshop 3: Intro to ggplot2, tidyr & dplyr ======+====== Workshop 3: Intro to ggplot2 ======
  
 Developed by: Xavier Giroux-Bougard,​ Maxwell Farrell, Amanda Winegardner,​ Étienne Low-Decarie and Monica Granados ​ Developed by: Xavier Giroux-Bougard,​ Maxwell Farrell, Amanda Winegardner,​ Étienne Low-Decarie and Monica Granados ​
  
-**Summary:​** In this workshop we will build on the data manipulation and visualization skills you have learned in base R by introducing ​three additional R packages: ggplot2, tidyr and dplyr. We’ll learn how to use ggplot2, an excellent plotting alternative to base R that can be used for both diagnostic and publication quality plots. We will then introduce tidyr and dplyr, two powerful tools to manage and re-format your dataset, as well as apply simple or complex functions on subsets of your data. This workshop will be useful for those progressing through the entire workshop series, but also for those who already have some experience in R and would like to become proficient with new tools and packages. ​+**Summary:​** In this workshop we will build on the data manipulation and visualization skills you have learned in base R by introducing ggplot2, an excellent plotting alternative to base R that can be used for both diagnostic and publication quality plots. We will then introduce tidyr and dplyr, two powerful tools to manage and re-format your dataset, as well as apply simple or complex functions on subsets of your data. This workshop will be useful for those progressing through the entire workshop series, but also for those who already have some experience in R and would like to become proficient with new tools and packages. ​
  
-Link to associated Prezi: ​[[http://prezi.com/daz9r0cj1si4/|Prezi]]+**Link to new [[https://qcbsrworkshops.github.io/Workshops/workshop03/​workshop03-en/​workshop03-en.html|Rmarkdown presentation]]**
  
-Download the R script for this lesson: {{:​workshop3.r|Script}}+Link to old [[http://​prezi.com/​daz9r0cj1si4/​|Prezi presentation]] 
 + 
 +Download the [[https://​raw.githubusercontent.com/​QCBSRworkshops/​Workshops/​dev/​workshop03/​workshop03-en/​script_workshop3.r|R script]] for this lesson.
 ===== 1. Plotting in R using the Grammar of Graphics (ggplot2) ===== ===== 1. Plotting in R using the Grammar of Graphics (ggplot2) =====
  
Line 499: Line 501:
 While hardcore programmers might laugh at you for using a GUI, there is no shame in using them! Jeroen Schouten, who is about as hardcore a programmer as you can get, understood the learning curve for begginners could be steep and so designed an online [[http://​rweb.stat.ucla.edu/​ggplot2/​|ggplot2 GUI]]. While it will not be as fully functional as coding the grammar of graphics, it is very complete. You can import from excel, google spreadsheets,​ or any data format, and build a few plots using some tutorial videos. The great part is that it shows you the code you have generated to build your figure, which you can copy paste into R as a skeleton on which to add some meat using more advanced features such as themes. ​ While hardcore programmers might laugh at you for using a GUI, there is no shame in using them! Jeroen Schouten, who is about as hardcore a programmer as you can get, understood the learning curve for begginners could be steep and so designed an online [[http://​rweb.stat.ucla.edu/​ggplot2/​|ggplot2 GUI]]. While it will not be as fully functional as coding the grammar of graphics, it is very complete. You can import from excel, google spreadsheets,​ or any data format, and build a few plots using some tutorial videos. The great part is that it shows you the code you have generated to build your figure, which you can copy paste into R as a skeleton on which to add some meat using more advanced features such as themes. ​
 {{::​yeroon.png?​600|}} {{::​yeroon.png?​600|}}
- 
  
  
  
 ---- ----
- 
----- 
- 
-===== 2. Using tidyr to reshape data frames ===== 
- 
-{{:​tidyrsticker.png?​200|}} 
-===== 2.1 Why "​tidy"​ your data? ===== 
- 
- 
-Tidying allows you to manipulate the structure of your data while preserving all original information. ​ 
-Many functions in R require or work better with a data structure that isn't the best for readability by people. ​ 
- 
-In contrast to aggregation,​ which reduces many cells in the original data set to one cell in the new dataset, tidying preserves a one-to-one connection. Although aggregation can be done with many functions in R, the tidyr package allows you to both reshape and aggregate within a single syntax. 
- 
- 
-Install / Load the ''​tidyr()''​ package: 
-<code rsplus | > 
-if(!require(tidyr)){install.packages("​tidyr"​)} 
-library(tidyr) 
-</​code>​ 
- 
-==== Data ==== 
- 
-In addition to ''​iris''​ and ''​CO2'',​ we will use the built-in dataset ''​airquality''​ for this part of the workshop 
- 
-Explore the datasets: 
- 
-<code rsplus | > 
-?airquality 
-str(airquality) 
-head(airquality) 
-names(airquality) 
-</​code>​ 
- 
-You can also use the following code to find other datasets available in R: ''​data()''​ 
- 
- 
-===== 2.2 Wide vs long data ===== 
- 
-Let's pretend you send out your field assistant to measure the diameter at breast height (DBH) and height of three tree species for you. The result is this "​wide"​ data set.  
- 
-<code rsplus | > 
-> wide <- data.frame(Species = c("​Oak",​ "​Elm",​ "​Ash"​),​ 
-                          DBH = c(12, 20, 13), 
-                       ​Height = c(56, 85, 55)) 
-> wide 
-  Species DBH Height 
-1     ​Oak ​ 12     56 
-2     ​Elm ​ 20     85 
-3     ​Ash ​ 13     55 
-</​code>​ 
- 
-"​Long"​ format data has a column stating the measured variable types and a column containing the values associated to those variables (each column is a variable, each row is an observation). This is considered "​tidy"​ data because it is easily interpreted by most packages for visualization and analysis in ''​R''​. 
- 
-The format of your data depends on your specific needs, but some functions and packages such as ''​ggplot2''​ work well with long format data. 
- 
-Additionally,​ long form data can more easily be aggregated and converted back into wide form data to provide summaries, or check the balance of sampling designs. 
- 
-We can use the ''​tidyr''​ package to: 
- 
-  * 1."​gather"​ our data (wide --> long) 
-  * 2."​spread"​ our data (long --> wide) 
- 
- 
- 
- 
-===== 2.3 Gather: Making your data long ===== 
-<code rsplus | > 
-?gather 
-</​code>​ 
- 
-Most of the packages in the Hadleyverse will require long format data where each row is an entry and each column is a variable. Let's try to "​gather"​ the this wide data using the gather function in tidyr. gather() takes multiple columns, and gathers them into key-value pairs. Note that you have to specify (data, what you want to gather across, the "​unit"​ of your new column, the row identity). ​ 
- 
-<code rsplus | > 
-> long <- gather(wide,​ Measurement,​ Value, DBH, Height) 
-> long 
-  Species Measurement Value 
-1     ​Oak ​        DBH 12 
-2     ​Elm ​        DBH 20 
-3     ​Ash ​        DBH 13 
-4     ​Oak ​     Height 56 
-5     ​Elm ​     Height 85 
-6     ​Ash ​     Height 55 
-</​code>​ 
- 
-Let's try this with the C02 dataset. Here we might want to collapse the last two quantitative variables: 
- 
-<code rsplus | > 
-CO2.long <- gather(CO2, response, value, conc, uptake) 
-head(CO2) 
-head(CO2.long) 
-tail(CO2.long) 
-</​code>​ 
- 
- 
-===== 2.4 Spread: Making your data wide ===== 
- 
-Sometimes you might want to go to from long to wide  
- 
-SPREAD BASICS:​spread uses the same syntax as gather (they are complements) 
- 
-<code rsplus | > 
-> wide2 <- spread(long,​ Measurement,​ Value) 
-> wide2 
-  Species DBH Height 
-1     ​Ash ​ 13     55 
-2     ​Elm ​ 20     85 
-3     ​Oak ​ 12     56 
-</​code>​ 
- 
----- 
-===== tidyr Challenge # 4 ==== 
-//Using the ''​airquality''​ dataset, ''​gather()''​ all the columns (except Month and Day) into rows. Then ''​spread()''​ the resulting dataset to return the same data format as the original data.// 
- 
-++++Solution| ​ 
-<code rsplus | > 
-air.long <- gather(airquality,​ variable, value, -Month, -Day) 
-head(air.long) 
-# Note that the syntax used here indicates we wish to gather ALL the columns except "​Month"​ and "​Day"​ 
-air.wide <- spread(air.long , variable, value) 
-head(air.wide) 
-</​code>​ 
-++++ 
- 
----- 
- 
-===== 2.5 separate: Separate two (or more) variables in a single column ===== 
- 
- 
-Some times you might have really messy data which has two variables in one column. Thankfully the separate function can (wait for it) separate the two variables into two columns ​ 
- 
-Let's say you have this really messy data set  
- 
-<code rsplus | > 
-set.seed(8) 
-really.messy <- data.frame(id = 1:4, 
-                          trt = sample(rep(c('​control',​ '​farm'​),​ each = 2)), 
-               ​zooplankton.T1 = runif(4), 
-                      fish.T1 = runif(4), 
-               ​zooplankton.T2 = runif(4), 
-                      fish.T2 = runif(4)) 
-</​code>​ 
- 
-First we want to convert this wide dataset to long  
- 
-<code rsplus | > 
-really.messy.long <- gather(really.messy,​ taxa, count, -id, -trt) 
-</​code>​ 
- 
-Then we want to split those two sampling time (T1 & T2). The syntax we use here is to tell R separate(data,​ what column, into what, by what) the tricky part here is telling R where to separate the character string in your column entry using a regular expression to describe the character that separates them.Here the string should be separated by the period (.) 
- 
-<code rsplus | > 
-really.messy.long.sep <- separate(really.messy.long,​ taxa, into = c("​species",​ "​time"​),​ sep = "​\\."​) ​ 
-</​code>​ 
  
 ===== 2.6 Combining ggplot with tidyr ===== ===== 2.6 Combining ggplot with tidyr =====
Line 775: Line 622:
  
 {{::​weather2.png?​nolink|}} {{::​weather2.png?​nolink|}}
- 
-===== 3. Data manipulation with dplyr ===== 
- 
-{{:​dplyrsticker.png?​200|}} 
- 
- 
-===== 3.1 Intro - the dplyr mission ===== 
- 
-The vision of the ''​dplyr''​ package is to simplify data manipulation by distilling all the common data manipulation tasks to a set of intuitive verbs. The result is a comprehensive set of tools that allows users to easily translate their thoughts into ''​R''​ code. In addition to ease of use, it is also an amazing package because: 
-  * it can crunch huge datasets wicked fast (written in ''​Cpp''​) 
-  * it plays nice with the RStudio IDE and other packages in the Hadleyverse ​ 
-  * it can interface with external databases and translate your R code into SQL queries 
-  * if Batman was an R package, he would be ''​dplyr''​ (mastering fear of data, adopting cool technologies) 
-  
-===== 3.2 Basic dplyr functions ===== 
- 
-The ''​dplyr''​ package is built around a core set of "​verbs"​ (or commands). We will start with the following 4 verbs because these operations are ubiquitous in data manipulation:​ 
- 
-  * ''​select()'':​ select columns from a data frame 
-  * ''​filter()'':​ filter rows according to defined criteria 
-  * ''​arrange()'':​ re-order data based on criteria (e.g. ascending, descending) 
-  * ''​mutate()'':​ create or transform values in a column ​ 
- 
-Let's load the ''​dplyr''​ package and explore these functions: 
- 
-<code rsplus | > 
-if(!require(dplyr)){install.packages("​dplyr"​)} 
-library(dplyr) 
-</​code>​ 
- 
-In these examples, we will use the ''​airquality''​ dataset. In the challenges we will use the ''​ChickWeight''​ dataset. ​ 
- 
-<code rsplus | > 
-?airquality 
-data(airquality) 
-?​ChickWeight 
-data(ChickWeight) 
-</​code>​ 
- 
-==== Select a subset of columns with ''​select()''​ ====  
- 
-The ''​airquality''​ dataset contains several columns: 
- 
-<code rsplus | > 
-> head(airquality) 
-  Ozone Solar.R Wind Temp Month Day 
-1    41     ​190 ​ 7.4   ​67 ​    ​5 ​  1 
-2    36     ​118 ​ 8.0   ​72 ​    ​5 ​  2 
-3    12     149 12.6   ​74 ​    ​5 ​  3 
-4    18     313 11.5   ​62 ​    ​5 ​  4 
-5    NA      NA 14.3   ​56 ​    ​5 ​  5 
-6    28      NA 14.9   ​66 ​    ​5 ​  6 
-</​code>​ 
- 
-Suppose we are only interested in the variation of "​Ozone"​ over time, then we can select the subset of required columns for further analysis: 
- 
-<code rsplus | > 
-> ozone <- select(airquality,​ Ozone, Month, Day) 
-> head(ozone) 
-  Ozone Month Day 
-1    41     ​5 ​  1 
-2    36     ​5 ​  2 
-3    12     ​5 ​  3 
-4    18     ​5 ​  4 
-5    NA     ​5 ​  5 
-6    28     ​5 ​  6 
-</​code>​ 
- 
-As you can see the general format for this function is ''​select(dataframe,​ column1, column2, ...)''​. Most ''​dplyr''​ functions will follow a similarly simple syntax. ​ 
- 
- 
-==== Select a subset of rows with ''​filter()''​ ====  
- 
-A common operation in data manipulation is the extraction of a subset based on specific conditions. For example, in the ''​airquality''​ dataset, suppose we are interested in analyses that focus on the month of August during high temperature events: 
- 
-<code rsplus | > 
-> august <- filter(airquality,​ Month == 8, Temp >= 90) 
-> head(august) 
-  Ozone Solar.R Wind Temp Month Day 
-1    89     229 10.3   ​90 ​    ​8 ​  8 
-2   ​110 ​    ​207 ​ 8.0   ​90 ​    ​8 ​  9 
-3    NA     ​222 ​ 8.6   ​92 ​    ​8 ​ 10 
-4    76     ​203 ​ 9.7   ​97 ​    ​8 ​ 28 
-5   ​118 ​    ​225 ​ 2.3   ​94 ​    ​8 ​ 29 
-6    84     ​237 ​ 6.3   ​96 ​    ​8 ​ 30 
-</​code>​ 
- 
-The syntax we employed here is ''​filter(dataframe,​ logical statement 1, logical statement 2, ...)''​. Remember that logical statements provide a TRUE or FALSE answer. The ''​filter()''​ function retains all the data for which the statement is TRUE. This can also be applied on characters and factors. 
- 
-==== Sort columns with ''​arrange()''​ ====  
- 
-In data manipulation,​ we sometimes need to sort our data (e.g. numerically or alphabetically) for subsequent operations. A common example of this is a time series. First let's use the following code to create a scrambled version of the ''​airquality''​ dataset: 
- 
-<code rsplus | > 
-> air_mess <- sample_frac(airquality,​ 1) 
-> head(air_mess) 
-    Ozone Solar.R Wind Temp Month Day 
-21      1       ​8 ​ 9.7   ​59 ​    ​5 ​ 21 
-42     ​NA ​    259 10.9   ​93 ​    ​6 ​ 11 
-151    14     191 14.3   ​75 ​    ​9 ​ 28 
-108    22      71 10.3   ​77 ​    ​8 ​ 16 
-8      19      99 13.8   ​59 ​    ​5 ​  8 
-104    44     192 11.5   ​86 ​    ​8 ​ 12 
-</​code>​ 
- 
-Now let's arrange the data frame back into chronological order, sorting by ''​Month''​ then ''​Day'':​ 
- 
-<code rsplus | > 
-> air_chron <- arrange(air_mess,​ Month, Day) 
-> head(air_chron) 
-  Ozone Solar.R Wind Temp Month Day 
-1    41     ​190 ​ 7.4   ​67 ​    ​5 ​  1 
-2    36     ​118 ​ 8.0   ​72 ​    ​5 ​  2 
-3    12     149 12.6   ​74 ​    ​5 ​  3 
-4    18     313 11.5   ​62 ​    ​5 ​  4 
-5    NA      NA 14.3   ​56 ​    ​5 ​  5 
-6    28      NA 14.9   ​66 ​    ​5 ​  6 
-</​code>​ 
- 
-Note that we can also sort in descending order by placing the target column in ''​desc()''​ inside the ''​arrange()''​ function. ​ 
- 
-==== Create and populate columns with ''​mutate()''​ ====  
- 
-Besides subsetting or sorting your data frame, you will often require tools to transform your existing data or generate some additional data based on existing variables. For example, suppose we would like to convert the temperature variable form degrees Fahrenheit to degrees Celsius: 
-<code rsplus | > 
-> airquality_C <- mutate(airquality,​ Temp_C = (Temp-32)*(5/​9)) 
-> head(airquality_C) 
-  Ozone Solar.R Wind Temp Month Day   ​Temp_C 
-1    41     ​190 ​ 7.4   ​67 ​    ​5 ​  1 19.44444 
-2    36     ​118 ​ 8.0   ​72 ​    ​5 ​  2 22.22222 
-3    12     149 12.6   ​74 ​    ​5 ​  3 23.33333 
-4    18     313 11.5   ​62 ​    ​5 ​  4 16.66667 
-5    NA      NA 14.3   ​56 ​    ​5 ​  5 13.33333 
-6    28      NA 14.9   ​66 ​    ​5 ​  6 18.88889 
-</​code>​ 
-Note that the syntax here is quite simple, but within a single call of the ''​mutate()''​ function, we can replace existing columns, we can create multiple new columns, and each new column can be created using newly created columns within the same function call. 
- 
-===== 3.3 dplyr and magrittr, a match made in heaven ===== 
- 
-{{:​magrittrsticker.png?​200|}} 
- 
-The ''​magrittr''​ package brings a new and exciting tool to the table: a pipe operator. Pipe operators provide ways of linking functions together so that the output of a function flows into the input of next function in the chain. The syntax for the ''​magrittr''​ pipe operator is ''​%>​%''​. The ''​magrittr''​ pipe operator truly unleashes the full power and potential of ''​dplyr'',​ and we will be using it for the remainder of the workshop. First, let's install and load it: 
- 
-<code rsplus | > 
-if(!require(magrittr)){install.packages("​magrittr"​)} 
-require(magrittr) 
-</​code>​ 
- 
-Using it is quite simple, and we will demonstrate that by combining some of the examples used above. Suppose we wanted to ''​filter()''​ rows to limit our analysis to the month of June, then convert the temperature variable to degrees Celsius. We can tackle this problem step by step, as before: 
- 
-<code rsplus | > 
-june_C <- mutate(filter(airquality,​ Month == 6), Temp_C = (Temp-32)*(5/​9)) 
-</​code>  ​ 
- 
-This code can be difficult to decipher because we start on the inside and work our way out. As we add more operations, the resulting code becomes increasingly illegible. Instead of wrapping each function one inside the other, we can accomplish these 2 operations by linking both functions together: 
- 
-<code rsplus | > 
-june_C <- airquality %>​% ​ 
-    filter(Month == 6) %>% 
-    mutate(Temp_C = (Temp-32)*(5/​9)) 
-</​code> ​ 
- 
-Notice that within each function, we have removed the first argument which specifies the dataset. Instead, we specify our dataset first, then "​pipe"​ into the next function in the chain. This is similar to ''​ggplot2'',​ in that we only specify the data frame once, not every single time we are adding a layer. The advantages of this approach are that our code is less redundant and functions are executed in the same order we read and write them, which makes its easier and quicker to both translate our thoughts into code and read someone else's code and grasp what is being accomplished. As the complexity of your data manipulations increases, it becomes quickly apparent why this is a powerful and elegant approach to writing your ''​dplyr''​ code.  
- 
-**Quick tip:** In RStudio we can insert this pipe quickly using the following hotkey: ''​Ctrl''​ (or ''​Cmd''​ for Mac) +''​Shift''​+''​M''​. 
- 
-===== 3.4 dplyr - Summaries and grouped operations ===== 
- 
-The ''​dplyr''​ verbs we have explored so far can be useful on their own, but they become especially powerful when we link them with each other using the pipe operator (''​%>​%''​) and by applying them to groups of observations. The following functions allow us to split our data frame into distinct groups on which we can then perform operations individually,​ such as aggregating/​summarising:​ 
- 
-  * ''​group_by()'':​ group data frame by a factor for downstream commands (usually summarise) 
-  * ''​summarise()'':​ summarise values in a data frame or in groups within the data frame with aggregation functions (e.g. ''​min()'',​ ''​max()'',​ ''​mean()'',​ etc...) 
- 
-These verbs provide the needed backbone for the Split-Apply-Combine strategy that was initially implemented in the ''​plyr''​ package on which ''​dplyr''​ is built. Let's demonstrate the use of these with an example using the ''​airquality''​ dataset. Suppose we are interested in the mean temperature and standard deviation within each month: 
- 
-<code rsplus | > 
-> month_sum <- airquality %>​% ​ 
-      group_by(Month) %>​% ​ 
-      summarise(mean_temp = mean(Temp), 
-                sd_temp = sd(Temp)) 
-> month_sum 
-Source: local data frame [5 x 3] 
- 
-  Month mean_temp ​ sd_temp 
-  (int)     ​(dbl) ​   (dbl) 
-1     ​5 ​ 65.54839 6.854870 
-2     ​6 ​ 79.10000 6.598589 
-3     ​7 ​ 83.90323 4.315513 
-4     ​8 ​ 83.96774 6.585256 
-5     ​9 ​ 76.90000 8.355671 
-</​code>​ 
  
 ---- ----
  
-===== dplyr CHALLENGE ===== 
-//Using the ''​ChickWeight''​ dataset, create a summary table which displays the difference in weight between the maximum and minimum weight of each chick in the study. Employ ''​dplyr''​ verbs and the ''​%>​%''​ operator.// 
- 
-++++Solution| ​ 
-<code rsplus | > 
-> weight_diff <- ChickWeight %>​% ​ 
-      group_by(Chick) %>​% ​ 
-      summarise(weight_diff = max(weight) - min(weight)) 
-> weight_diff 
-Source: local data frame [50 x 2] 
- 
-    Chick weight_diff 
-   ​(fctr) ​      (dbl) 
-1      18           4 
-2      16          16 
-3      15          27 
-4      13          55 
-5       ​9 ​         58 
-6      20          76 
-7      10          83 
-8       ​8 ​         92 
-9      17         100 
-10     ​19 ​        114 
-..    ...         ... 
-</​code>​ 
- 
-Note that we are only calculating the difference between max and min weight. This doesn'​t necessarily correspond to the difference in mass between the beginning and the end of the trials. Closely inspect the data for chick # 18 to understand why this is the case: 
- 
-<code rsplus | > 
-> chick_18 <- ChickWeight %>% filter(Chick == 18) 
-> chick_18 
-  weight Time Chick Diet 
-1     ​39 ​   0    18    1 
-2     ​35 ​   2    18    1 
-</​code>​ 
- 
-Here we notice that chick 18 has in fact lost weight (and probably died during the trial). From a scientific perspective,​ perhaps a more interesting question is which of the 4 diets results in the greatest weight gain in chicks. We could calculate this using 2 more useful ''​dplyr''​ functions: ''​first()''​ and ''​last()''​ allow us to access the (need I say respectively) first and last observation within a group. ​ 
-++++ 
----- 
- 
-==== Ninja Hint ==== 
- 
-Note that we can group the data frame using more than one factor, using the general syntax as follows: ''​group_by(group1,​ group2, ...)''​ 
- 
-Within ''​group_by()'',​ the multiple groups create a layered onion, and each subsequent single use of the ''​summarise()''​ function peels off the outer layer of the onion. In the above example, after we carried out a summary operation on ''​group2'',​ the resulting data set would remain grouped by ''​group1''​ for downstream operations. 
- 
----- 
- 
-===== dplyr NINJA CHALLENGE ===== 
-//Using the ''​ChickWeight''​ dataset, create a summary table which displays, for each diet, the average individual difference in weight between the end and the beginning of the study. Employ ''​dplyr''​ verbs and the ''​%>​%''​ operator. (Hint: ''​first()''​ and ''​last()''​ may be useful here.)// 
- 
-++++Solution| ​ 
-<code rsplus | > 
-> diet_summ <- ChickWeight %>​% ​ 
-      group_by(Diet,​ Chick) %>​% ​ 
-      summarise(weight_gain = last(weight) - first(weight)) %>​% ​ 
-      group_by(Diet) %>​% ​ 
-      summarise(mean_gain = mean(weight_gain)) 
-> diet_summ 
-# A tibble: 4 × 2 
-    Diet mean_gain 
-  <​fctr> ​    <​dbl>​ 
-1      1     114.9 
-2      2     174.0 
-3      3     229.5 
-4      4     188.3 
-</​code>​ 
- 
- 
-Given that the solution to the last challenge requires that we compute several operations in sequence, it provides a nice example to demonstrate why the syntax implemented by ''​dplyr''​ and ''​magrittr''​. An additional challenge if you are well versed in base ''​R''​ functions would to reproduce the same operations using fewer key strokes. We tried, and failed... Perhaps we are too accustomed to ''​dplyr''​ now. 
-++++ 
----- 
- 
- 
-===== 3.5 dplyr - Merging data frames ===== 
- 
-In addition to all the operations we have explored, ''​dplyr''​ also provides some functions that allow you to join two data frames together. The syntax in these functions is simple relative to alternatives in other ''​R''​ packages: 
- 
-  * ''​left_join()''​ 
-  * ''​right_join()''​ 
-  * ''​inner_join()''​ 
-  * ''​anti_join()''​ 
- 
-These are beyond the scope of the current introductory workshop, but they provide extremely useful functionality you may eventually require for some more advanced data manipulation needs. 
  
 ===== 4. Resources ===== ===== 4. Resources =====
  
-Here are some great resources for learning ggplot2, tidyr and dplyr that we used when compiling this workshop:+Here are some great resources for learning ggplot2 that we used when compiling this workshop:
  
 //ggplot2// //ggplot2//
 +  * [[https://​www.r-graph-gallery.com|The R Graph gallery]]
 +  * [[http://​sape.inf.usi.ch/​quick-reference/​ggplot2|The Software and Programmer Efficiency Research Group ggplot2 Quick Reference guide]]
   * [[http://​shinyapps.stat.ubc.ca/​r-graph-catalog/​|The R Graph Catalog]]   * [[http://​shinyapps.stat.ubc.ca/​r-graph-catalog/​|The R Graph Catalog]]
   * [[https://​www.rstudio.com/​wp-content/​uploads/​2015/​03/​ggplot2-cheatsheet.pdf|The RStudio ggplot2 Cheat Sheet]]   * [[https://​www.rstudio.com/​wp-content/​uploads/​2015/​03/​ggplot2-cheatsheet.pdf|The RStudio ggplot2 Cheat Sheet]]
Line 1065: Line 639:
   * [[http://​stat405.had.co.nz/​lectures/​11-adv-data-manip.pdf]]   * [[http://​stat405.had.co.nz/​lectures/​11-adv-data-manip.pdf]]
   * [[http://​stat405.had.co.nz/​lectures/​19-tables.pdf]]   * [[http://​stat405.had.co.nz/​lectures/​19-tables.pdf]]
- 
-//dplyr and tidyr// 
-  * [[https://​www.rstudio.com/​wp-content/​uploads/​2015/​02/​data-wrangling-cheatsheet.pdf|The RStudio Data Wrangling Cheat Sheet]] 
-  * [[https://​cran.rstudio.com/​web/​packages/​dplyr/​vignettes/​introduction.html|CRAN Intro to dplyr]] 
-  * [[http://​seananderson.ca/​2014/​09/​13/​dplyr-intro.html|Sean Anderson'​s Intro to dplyr and pipes]] 
-  * [[https://​rpubs.com/​bradleyboehmke/​data_wrangling|Bradley Boehmke'​s Intro to data wrangling]] 
- 
  
 **BONUS!** Check out R style guides to help format your scripts for easy reading: **BONUS!** Check out R style guides to help format your scripts for easy reading:
   * [[http://​adv-r.had.co.nz/​Style.html]]   * [[http://​adv-r.had.co.nz/​Style.html]]
-