This series of 10 workshops walks participants through the steps required to use R for a wide array of statistical analyses relevant to research in biology and ecology. These open-access workshops were created by members of the QCBS both for members of the QCBS and the larger community.
The content of this workshop has been peer-reviewed by several QCBS members. If you would like to suggest modifications, please contact the current series coordinators, listed on the main wiki page
Developed by: Catherine Baltazar, Bérenger Bourgeois, Zofia Taranu
Summary: In this workshop, you will learn how to implement basic linear models commonly used in ecology in R such as simple regression, analysis of variance (ANOVA), analysis of covariance (ANCOVA), and multiple regression. After verifying visually and statistically the assumptions of these models and transforming your data when necessary, the interpretation of model outputs and the plotting of your final model will no longer keep secrets from you!
Link to associated Prezi: Prezi
Download the R script and data for this lesson:
Scientists have always been interested in determining relationships between variables. Depending on the kind of variables considered and their number, different statistical tools can be used to assess these relationships. The table below lists the five types of statistical analysis that will be covered in this workshop:
|Statistical analysis|| Type of response|
| Type of explanatory|
| Number of|
| Number of
|Simple linear regression||Continuous||Continuous||1|
|ANOVA||Categorical|| 1 (one-way ANOVA), 2 (two-|
way ANOVA) or more
|3 or more|
|2 or more||2 or more|
|Multiple regression||Continuous||2 or more|
The purpose of this analysis is to relate a response variable, y, and an explanatory variable, x, through a straight line. The mathematical model corresponding to linear regression is given by:
is the intercept of the regression line,
is the slope of the regression line,
is the continuous explanatory variable,
are the residuals of the model (i.e. the unexplained variation).
The goal here is to find a correct estimation of these two regression parameters and then assess the goodness of fit of the regression model. While several methods have been developed to calculate the intercept and slope coefficients of regression models, ordinary least squares is the most widely used estimation method, and also corresponds to the default method of the
lm function in R. Ordinary least squares fits a line such that the sum of the squared vertical distances between the observed data and the linear regression model (i.e. the residuals) is minimized. From this method, the intercept β0 and the slope β1 of the linear regression can be calculated as:
To be valid, a linear regression must meet 4 assumptions, otherwise the model results cannot be safely interpreted.
Using the bird dataset, we will first examine the linear regression of maximum abundance as a function of mass.
In R, linear regression is implemented using the lm function from the stats package:
lm (y ~ x)
|Note: before using a new function in R, users should refer to its help documentation (
The bird dataset contains 7 variables:
|Family||Common name of family||String|
|MaxAbund|| The highest observed abundance at any site|
in North America
|AvgAbund|| The average abundance across all sites|
where found in NA
|Mass||The body size in grams||Continuous/ numeric|
|Diet||Type of food consumed|| Discrete – 5 levels (Plant; PlantInsect;
Insect; InserctVert; Vertebrate)
|Passerine||Is it a songbird/ perching bird||Boolean (0/1)|
|Aquatic||Is it a bird that primarily lives in/ on/ next to the water||Boolean (0/1)|
We are now ready to run our linear model:
Residual vs Fitted plot - The first graph of the diagnostic plots called by
plot(lm1) illustrates the spread of the residuals between each fitted values, if homogenous then the homoscedasticty is respected. The Residuals vs Fitted values should show a similar scatter across all Fitted values (x-axis).
If the relationship between the response variable and the explanatory variable is not linear, then it will show up here.
Scale-location plot - The third graph of the diagnostic plots enables one to verify whether the residuals spread increases with a given fitted values (i.e. identifies whether the spread in the residuals is due to the selected explanatory variable). If the spread increases, the homoscedasticity assumption is not respected.
QQ plot - Independence can be assessed from the QQplot of the diagnostic plots. It allows you to check the distribution of the model residuals and verify the normality of the response and explanatory variables. This graph compares the probability distribution of the model residuals to the probability distribution of normal data series. If the standardized residuals lie linearly on the 1:1 line of the QQplot, the residuals can be considered normally distributed.
The points of the QQplot are nonlinear, which suggests that the residuals are not normally distributed.
Residuals vs Leverage plot - High leverage data can be visualised on the fourth diagnostic plots (i.e. residuals vs leverage), which identifies the observation numbers of the high leverage data point(s). If (and only if!) these observations correspond to mismeasurements or represent exceptions, they can be removed from the original dataset.
In the example provided above, the MaxAbund (response variable) and Mass (explanatory variable) were not normally distributed and the linear regression assumptions were violated. The next step is to try to normalize the variables using transformations. To assess the normality of a variable, draw a histogram using the function
hist, and check visually whether the data series appears to follow a normal distribution. For example:
# Plot Y ~ X and the regression line # Plot Y ~ X and the regression line plot(bird$MaxAbund ~ bird$Mass, pch=19, col="coral", ylab="Maximum Abundance", xlab="Mass") abline(lm1, lwd=2) ?plot # For further details on plot() arguments # see colours() for list of colours # Is the data normally distributed? hist(bird$MaxAbund,col="coral", main="Untransformed data", xlab="Maximum Abundance") hist(bird$Mass, col="coral", main="Untransformed data", xlab="Mass")
A second way to assess normality is to use the Shapiro-Wilk normality test that compares the distribution of the observed data series to a normal distribution using the function
The null and alternate hypotheses of this test are:
H0: the observed data series is normally distributed,
H1: the observed data series is not normally distributed,
The observed data series can be considered normally distributed when the p-value calculated by the Shapiro-Wilk normality test is greater than or equal to α, typically set to 0.05.
# Test null hypothesis that the sample came from a normally distributed population shapiro.test(bird$MaxAbund) shapiro.test(bird$Mass) # If p < 0.05, then distribution is not-normal # if p > 0.05, then distribution is normal
We can also evaluate the skewness of each distribution using the
skewness(bird$MaxAbund) skewness(bird$Mass) # where positive values indicate a left-skewed distribution, and negative value a right skew.
The histograms, Shapiro tests and Skewness all indicate that the variables need to be transformed to normalize (e.g. a log10 transformation).
In case of non-normality, response and explanatory variables can be transformed to enhance their normality following these rules:
|Type of distribution||Transformation||R function|
|Moderately positive skewness||sqrt(x)|
|Substantially positive skewness||log10(x)|
|Substantially positive skewness||log10(x + C) where C is a constant
added to each value of x so that
the smallest score is 1
|Moderately negative skewness|| sqrt(K - x) where K is a constant
subtracted from each value of x
so that the smallest score is 1
|Substantially negative skewness||log10(K - x)|
Thus, log10 transformations should be applied and saved in the bird data frame. The model can then be re-runned, verified and interpreted.
# Add log10() transformed variables to your dataframe bird$logMaxAbund <- log10(bird$MaxAbund) bird$logMass <- log10(bird$Mass) names(bird) # to view the dataframe + new transformed variables hist(bird$logMaxAbund,col="yellowgreen", main="Log transformed", xlab=expression("log"*"(Maximum Abundance)")) hist(bird$logMass,col="yellowgreen", main="Log transformed", xlab=expression("log"*"(Mass)")) shapiro.test(bird$logMaxAbund); skewness(bird$logMaxAbund) shapiro.test(bird$logMass); skewness(bird$logMass) # Re-run your analysis with the appropriate transformations lm2 <- lm(bird$logMaxAbund ~ bird$logMass) # Are there remaining problems with the diagnostics (heteroscedasticity, non-independence, high leverage)? opar <- par(mfrow=c(2,2)) plot(lm2, pch=19, col="gray") par(opar)
Once all these assumptions have been verified, the model results can be interpreted. These results are called in R using the function
# Now we can look at the model coefficients and p-values summary(lm2) # You can also just call up the coefficients of the model lm2$coef # What else? str(summary(lm2)) summary(lm2)$coefficients # where Std. Error is the standard error of each estimate summary(lm2)$r.squared # Coefficient of determination summary(lm2)$adj.r.squared # Adjusted coefficient of determination summary(lm2)$sigma # Residual standard error (square root of Error Mean Square) # etc… # You can also check for yourself the equation for R2: SSE = sum(resid(lm2)^2) SST = sum((bird$logMaxAbund - mean(bird$logMaxAbund))^2) R2 = 1 - ((SSE)/SST) R2
The output of this function presents all the results of your validated model:
lm(formula = logMaxAbund ~ logMass, data = bird) Residuals: Min 1Q Median 3Q Max -1.93562 -0.39982 0.05487 0.40625 1.61469 Estimate Std. Error t value Pr(>|t|) (Intercept) 1.6724 0.2472 6.767 1.17e-08 *** logMass -0.2361 0.1170 -2.019 0.0487 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.6959 on 52 degrees of freedom Multiple R-squared: 0.07267, Adjusted R-squared: 0.05484 F-statistic: 4.075 on 1 and 52 DF, p-value: 0.04869
The coefficients of the regression model and their associated standard error appear in the second and third columns of the regression table, respectively. Thus,
β0 = 1.6724 ± 0.2472 is the intercept (± se) of the regression model,
β1 = -0.2361 ± 0.1170 is the slope (± se) of the regression model.
and finally: logMaxAbund = 1.6724 (± 0.2472) - 0.2361 (± 0.1170) x logMass
The t-value and their associated p-value (in the fourth and fifth columns of the regression table, respectively) test for a significant difference between the calculated coefficients and zero. In this case, we can see that logMass has a significant influence on logMaxAbund because the p-value associated with the slope of the regression model is inferior to 0.05. Moreover, these two variables are negatively related as the slope of the regression model is negative.
Users must, however, be aware that significant relationship between two variables does not always imply causality. Conversely, the absence of significant linear regression between y and x does not always imply an absence of relationship between these two variables; this is for example the case when a relationship is not linear.
The goodness of fit of the linear regression model is assessed from the adjusted-R2 (here, 0.05484), given by:
p is the total number of regressors and n is the sample size,
is the total sums of squares,
is the regression sums of squares - also called the explained sums of squares.
The higher the adjusted-R2 is, the better the data fit the statistical model, knowing that this coefficient varies between 0 and 1. In this case, the relationship between logMaxAbund and logMass is quite weak.
The last line of the R output represents the F-statistic of the model and its associated p-value. If this p-value is inferior to 0.05, the model explains the data relationship better than a null model.
Linear regression results are generally represented by a plot of the response variable as a function of the explanatory variable on which the regression line is added (and if needed the confidence intervals), using the R code:
plot(logMaxAbund ~ logMass, data=bird, pch=19, col="yellowgreen", ylab = expression("log"*"(Maximum Abundance)"), xlab = expression("log"*"(Mass)")) abline(lm2, lwd=2) # You may also flag the previously identified high-leveraged points points(bird$logMass, bird$logMaxAbund, pch=19, col="violet") points(bird$logMass, bird$logMaxAbund, pch=19, col="violet") points(bird$logMass, bird$logMaxAbund, pch=19, col="violet") # We can also plot the confidence intervals confit<-predict(lm2,interval="confidence") points(bird$logMass,confit[,2]) points(bird$logMass,confit[,3])
We may also run the analysis on a subset of observations, for example, on terrestrial birds only.
# Recall that you can exclude objects using "!" # We can analyze a subset of this data using this subset command in lm() lm3 <- lm(logMaxAbund ~ logMass, data=bird, subset =! bird$Aquatic) # removing the Aquatic birds # or equivalently lm3 <- lm(logMaxAbund ~ logMass, data=bird, subset=bird$Aquatic == 0) # Examine the model opar <- par(mfrow=c(2,2)) plot(lm3, pch=19, col=rgb(33,33,33,100,maxColorValue=225)) summary(lm3) par(opar) # Compare the two datasets opar <- par(mfrow=c(1,2)) plot(logMaxAbund ~ logMass, data=bird, main="All birds", ylab = expression("log"*"(Maximum Abundance)"), xlab = expression("log"*"(Mass)")) abline(lm2,lwd=2) plot(logMaxAbund ~ logMass, data=bird, subset=!bird$Aquatic, main="Terrestrial birds", ylab = expression("log"*"(Maximum Abundance)"), xlab = expression("log"*"(Mass)")) abline(lm3,lwd=2) opar(par)
Examine the relationship between log10(MaxAbund) and log10(Mass) for passerine birds.
HINT: Passerine is coded 0 and 1 just like Aquatic. You can verify this by viewing the structure
The Student’s t-test, or simply t-test enables the comparison of a continuous variable according to a categorical variable divided in two groups (or treatments), and only two groups. Based on the following mathematical model, t-test compares the response variables, y, between the two groups A1 and A2 of the explanatory variable:
µ is the mean of the response variable,
corresponds to the effect of group i,
i varies from 1 to 2,
are the residuals of the model (i.e. the unexplained variation).
The statistical hypotheses of the t-test evaluates the difference between the two explanatory groups, i.e.:
H0: µ1 = µ2
H1: µ1 ≠ µ2
µ1 is the mean of the response variable for group 1,
µ2 is the mean of the response variable for group 2.
The above bilateral alternative hypothesis H1 searches for a difference in the response variables between the two groups.
However, if the sense of the expected difference is supported by a biological hypothesis, unilateral alternative hypotheses can also be used:
For the t-test, the t statistic used to find the p-value calculation is calculated as:
1 and 2 are the means of the response variable y for group 1 and 2, respectively,
s12 and s22 are the variances of the response variable y for group 1 and 2, respectively,
n1 and n2 are the sample sizes of groups 1 and 2, respectively.
If the assumptions of the t-test are not met, the test can give misleading results. These assumptions concern the shape of the distribution of the data:
For the example above, with the pooled variance the type I error is actually larger than the α set from the group 1 sample. Thus, we would conclude do not reject, when actually the α cut-off was wrong and we should have rejected!
If variances between groups are not equal, it is possible to use corrections, like the Welch correction. If assumptions cannot be respected, the non-parametric equivalent of t-test is the Mann-Whitney test. Finally, if the two groups are not independent (e.g. measurements on the same individual at 2 different years), you should use a Paired t-test.
In R, t-tests are implemented using the function
t.test. For example, to test for a mass difference between aquatic and non-aquatic birds, you should write:
# T-test boxplot(logMass ~ Aquatic, data=bird, ylab=expression("log"*"(Bird Mass)"), names=c("Non-Aquatic","Aquatic"), col=c("yellowgreen","skyblue")) # First, let's test the assumption of equal variance # Note: we do not need to test the assumption of normally distributed data since # we already log transformed the data above tapply(bird$logMass,bird$Aquatic,var) var.test(logMass~Aquatic,data=bird) # We are now ready to run the t-test ttest1 <- t.test(Mass~Aquatic, var.equal=TRUE, data=bird) # or equivalently ttest1 <- t.test(x=bird$logMass[bird$Aquatic==0], y=bird$logMass[bird$Aquatic==1], var.equal=TRUE) ttest1
Two Sample t-test data: logMass by Aquatic t = -7.7707, df = 52, p-value = 2.936e-10 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.6669697 -0.9827343 sample estimates: mean of x mean of y 1.583437 2.908289
Here, we show that the ratio of variances is not statistically different from 1, therefore variances are equal, and we proceeded with our t-test. Since p < 0.05, the hypothesis of no difference between the two bird types (Aquatic vs. terrestrial) was rejected.
Don't forget that the t-test is still a linear model and a specific case of ANOVA (see below) with one factor with 2 levels. As such, we can also run the t-test with the
lm() function in R:
Analysis of Variance Table Response: logMass Df Sum Sq Mean Sq F value Pr(>F) Aquatic 1 19.015 19.0150 60.385 2.936e-10 *** Residuals 52 16.375 0.3149 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
When variances are equal (i.e., two-sample t-test), we can show that t2 = F:
The alternative option of the
t.test function allows for the use of unilateral t-test. For example, if users want to test if non-aquatic birds are less heavy than aquatic birds, the function can be written:
In the R output, called by
uni.ttest1, the results of the t-test appear in the third line:
Two Sample t-test data: logMass by Aquatic t = -7.7707, df = 52, p-value = 1.468e-10 alternative hypothesis: true difference in means is less than 0 95 percent confidence interval: -Inf -1.039331 sample estimates: mean in group 0 mean in group 1 1.583437 2.908289
In this case, the calculated t-statistic is t = -7.7707 with df = 52 degrees of freedom that gives a p-value of p-value = 1.468e-10. As the calculated p-value is inferior to 0.05, the null hypothesis is rejected. Thus, aquatic birds are significantly heavier than non-aquatic birds.
Analysis of Variance (ANOVA) corresponds to a generalization of the Student’s t-test. ANOVA enables the comparison of a continuous variable according to a categorical variable divided in more than two groups (or treatments), whereas the number of groups in the Student’s t-test was restricted to two. Based on the following mathematical model, ANOVA thus compares the response variables y between the j groups of the explanatory variable:
µ is the general mean of the response variable,
Ai is the effect of the level i for the factor A,
i varies from 1 to n (n ≥ 2),
εij are the residuals of the model (i.e. the unexplained variation).
The statistical hypotheses of the ANOVA aim to identify if one of the j groups differ from the others in terms of the response variable y:
H0: µ1 = µ2 =… = µj =… = µn
H1: there is at least one µj that differs from the others
To determine whether the null hypothesis must be accepted or rejected, ANOVA calculations are based on the sum of squares partitioning and compares the within-treatment variance to the between-treatment variance. If the between-treatment variance is greater than the within-treatment variance, this means that the treatments affect the explanatory variable more than the random error (corresponding to the within-treatment variance), and that the explanatory variable is likely to be significantly influenced by the treatments.
In the ANOVA, the comparison of the between-treatment variance to the within-treatment variance is made through the calculation of the F-statistic that correspond to the ratio of the mean sum of squares of the treatment (MSTrt) on the mean sum of squares of the error (MSE). These two last terms are obtained by dividing their two respective sums of squares by their corresponding degrees of freedom, as presented in the following ANOVA table. Then, the p-value of the ANOVA is calculated from the F-statistic that follows a Chi-square (χ2) distribution.
| Source of|
| Degrees of|
|Sums of squares||Mean squares||F-statistic|
a: number of levels of the explanatory variable A; r: number of replicates per treatment; : general mean of the explanatory variable; i : mean of the explanatory variable for all the replicates of the treatment i.
As with the simple linear regression and t-test, ANOVA must meet different statistical assumptions to be valid, among which two are particularly important. These assumptions can be verified using the diagnostic plots or with parametric tests.
bartlett.test. If the p-value of this test is superior to 0.05, the null hypothesis H0: s12 = s22 =… = sj2 =… = sn2 is accepted and the homoscedasticity assumption is respected.
If assumptions are violated your can try to transform your data, which could potentially equalize variances and normalize residuals, and can convert a multiplicative effect into an additive effect. Or, if you can't (or don't want to) transform your data, the non-parametric equivalent of ANOVA is Kruskal-Wallis test.
The number of comparisons has to be lower or equal to the number of degrees of freedom of the ANOVA. Comparisons have to be independent from one another. For more details, see the advanced section on contrasts below.
First, let's visualize the data using
boxplot(). Recall that by default, R will order you groups in alphabetical order. We can reorder the groups according to the median of each Diet level.
Another way to graphically view the effect sizes is to use
plot.design(). This function will illustrate the levels of a particular factor along a vertical line, and the overall value of the response is drawn as a horizontal line.
# Default alphabetical order boxplot(logMaxAbund ~ Diet, data=bird) # Relevel factors med <- sort(tapply(bird$logMaxAbund, bird$Diet, median)) boxplot(logMaxAbund ~ factor(Diet, levels=names(med)), data=bird, col=c("white","lightblue1", "skyblue1","skyblue3","skyblue4")) plot.design(logMaxAbund ~ Diet, data=bird, ylab = expression("log"*"(Maximum Abundance)"))
Let's now run the ANOVA. In R, ANOVA can be called either directly with the
aov function, or with the
anova function performed on a linear model previously implemented with
# Plot for diagnostics opar <- par(mfrow=c(2,2)) plot(anov1) par(opar) # Test assumption of normality of residuals shapiro.test(resid(anov1)) # Test assumption of homogeneity of variance bartlett.test(logMaxAbund ~ Diet, data=bird)
Ideally the first diagnostic plot should show similar scatter for each Diet level. The Shapiro and Bartlett tests are both non-significant, therefore residuals are assumed to be normally distributed and variances are assumed to be equal.
Once your ANOVA model has been validated, its results can be interpreted. The R output of ANOVA model depends of the function that has been used to implement the ANOVA. If the
aov function is used to implement the ANOVA model
the results of the ANOVA can be visualized using the function
On the other hand, if
lm() is used
the ANOVA results must be called using the function
In both cases, the R output is as follows:
Df Sum Sq Mean Sq F value Pr(>F) Diet 4 5.106 1.276 2.836 0.0341 * Residuals 49 22.052 0.450 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This R output corresponds exactly to the ANOVA table of your model. This output also present the degrees of freedom, the sum of squares, the mean sum of squares and the F-value previously explained. For this example, the diet significantly influences the abundance of birds as the p-value is inferior to 0.05. The null hypothesis can then be rejected meaning that at least one of the diet treatments influenced the abundance differently than the other treatments.
Importantly, ANOVA cannot identify which treatment is different from the others in terms of response variable. To determine this, post-hoc tests that compare the levels of the explanatory variables (i.e. the treatments) two by two, must be performed. While several post-hoc tests exist (e.g. Fischer’s least significant difference, Duncan’s new multiple range test, Newman-Keuls method, Dunnett’s test, etc.), the Tukey’s range test is used in this example using the function
TukeyHSD as follows:
The R output for this test gives a table containing all the two by two comparisons of the explanatory variable levels and identify which treatment differ from the others:
Tukey multiple comparisons of means 95% family-wise confidence level factor levels have been ordered Fit: aov(formula = anov1) $Diet diff lwr upr p adj Vertebrate-InsectVert 0.3364295 -1.11457613 1.787435 0.9645742 Insect-InsectVert 0.6434334 -0.76550517 2.052372 0.6965047 Plant-InsectVert 0.8844338 -1.01537856 2.784246 0.6812494 PlantInsect-InsectVert 1.0657336 -0.35030287 2.481770 0.2235587 Insect-Vertebrate 0.3070039 -0.38670951 1.000717 0.7204249 Plant-Vertebrate 0.5480043 -0.90300137 1.999010 0.8211024 PlantInsect-Vertebrate 0.7293041 0.02128588 1.437322 0.0405485 Plant-Insect 0.2410004 -1.16793813 1.649939 0.9884504 PlantInsect-Insect 0.4223003 -0.19493574 1.039536 0.3117612 PlantInsect-Plant 0.1812999 -1.23473664 1.597336 0.9961844
In this case, the only significant difference in abundance occurs between the PlantInsect diet and the Vertebrate diet.
After having verified the assumptions of your ANOVA model, interpreted the ANOVA table and differentiated the effect of the treatments using post-hoc tests or contrasts, the ANOVA results can be graphically illustrated using a
barplot. This shows the response variable as a function of the explanatory variable levels, where standard errors can be superimposed on each bar as well as the different letters representing the treatment group (according to the post-hoc test).
# Graphical illustration of ANOVA model using barplot() sd <- tapply(bird$logMaxAbund,list(bird$Diet),sd) means <- tapply(bird$logMaxAbund,list(bird$Diet),mean) n <- length(bird$logMaxAbund) se <- 1.96*sd/sqrt(n) bp <- barplot(means, col=c("white","lightblue1","skyblue1","skyblue3","skyblue4"), ylab = expression("log"*"(Maximum Abundance)"), xlab="Diet", ylim=c(0,1.8)) # Add vertical se bars segments(bp, means - se, bp, means + se, lwd=2) # and horizontal lines segments(bp - 0.1, means - se, bp + 0.1, means - se, lwd=2) segments(bp - 0.1, means + se, bp + 0.1, means + se, lwd=2)
Click to display ⇲
Click to hide ⇱
A second R output of the ANOVA results, called contrasts, can be called to visualize the parameter estimates for each level of the explanatory variable in comparison to a baseline level. This output is called using the function
summary.lm(aov1) when the ANOVA model was implemented with the
aov function, and using the
summary(anov1) when the ANOVA was implemented with the
lm function. This output performs a linear regression for each level of the explanatory variable and calculates their associated parameters:
Call: lm(formula = logMaxAbund ~ Diet, data = bird) Residuals: Min 1Q Median 3Q Max -1.85286 -0.32972 -0.08808 0.47375 1.56075 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.1539 0.1500 7.692 5.66e-10 *** DietInsectVert -0.6434 0.4975 -1.293 0.2020 DietPlant 0.2410 0.4975 0.484 0.6303 DietPlantInsect 0.4223 0.2180 1.938 0.0585 . DietVertebrate -0.3070 0.2450 -1.253 0.2161 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.6709 on 49 degrees of freedom Multiple R-squared: 0.188, Adjusted R-squared: 0.1217 F-statistic: 2.836 on 4 and 49 DF, p-value: 0.0341
Users can note that the last line of this R output corresponds exactly to the previous R output shown in the ANOVA table. The F-statistic of the ANOVA model and its associated p-value (i.e. 2.836 and 0.0341), are the same one as the values obtained from the ANOVA table, indicating diet explains the abundance better than a null model, and so, diet significantly influence the abundance. The goodness-of-fit of the ANOVA model (i.e. adjusted R-square value) appears in the second to last line of this output. In this case, diet explains 12.17% of the abundance variability.
Contrasts adjust a linear regression of the response variable as function of each level of the categorical explanatory variable separately. In this case, 5 linear regressions (corresponding to the five lines of the coefficients table of the R output) are calculated by the lm function as the diet variable contains 5 levels. By default, the baseline level corresponding to the intercept is the first level of explanatory variables ranked in alphabetical order. So in this case, the “Insect” diet is automatically used as a baseline in R. In the R output, the coefficient estimate of the baseline level (here, 1.1539) is first compared to 0 using a t-test (in this case, the t-test is significant with a p-value of 5.66e-10), while the coefficient estimates of the other explanatory variable level are compared to the baseline level. In this case, only the PlantInsect diet differs from the Insect diet, with an associated p-value of 0.0585.
In other words, this R output allows us to determine the mean of the response variable for each of the diet levels, for example:
LogMaxAbund = 1.1539 for the Insect diet,
LogMaxAbund = 1.1539 – 0.6434 for the InsectVert diet,
LogMaxAbund = 1.1539 + 0.2410 for the Plant diet,
As this type of contrasts compares each level of the explanatory variable to a baseline level, they are called contr.treatment and constitute the default method of the
lm function in R. The baseline level can, however, be changed using the
relevel function. For example,
compares each diet treatment to the Plant diet, now defined as the baseline level.
The contrasts coefficient matrix of these contr.treatment comparisons can be called by
Insect InsectVert PlantInsect Vertebrate Plant 0 0 0 0 Insect 1 0 0 0 InsectVert 0 1 0 0 PlantInsect 0 0 1 0 Vertebrate 0 0 0 1
where each column corresponds to a comparison to the baseline Plant and each line to a diet level. For example, the first comparison compares the Insect diet to the Plant diet, the second one compares the InsectVert diet to the Plant diet, etc.
Users can create their own contrasts coefficient matrix in order to perform the comparison they desire using the contrasts function. For example,
creates this contrasts coefficient matrix:
[,1] [,2] [,3] [,4] Plant 4 0 0 0 Insect -1 1 0 1 InsectVert -1 1 0 -1 PlantInsect -1 -1 1 0 Vertebrate -1 -1 -1 0
Thus, for each column, the treatments with identical contrasts coefficient belong to the same two by two comparison group (e.g. in the column 1, the four treatment with a -1 coefficient belong to the first comparison group and are altogether compare to the second group corresponding to the treatment with a different coefficient, here the Plant diet with a coefficient 4). Thus, all comparison possible can be performed with personalized contrasts coefficient matrix. Only two conditions restrict the possible contrasts coefficient matrix:
This can be verified using the following R commands
sum(contrasts(bird$Diet)[,1]) # first condition for column 1 sum(contrasts(bird$Diet)[,1]*contrasts(bird$Diet)[,2]) # second condition for column 1 and 2
Conventional contrast coefficient matrices are already programmed in R, such as the contr.helmert or the contr.poly matric (cf.
To better explain the variability of a response variable, two categorical variables can also be used as explanatory variables in an ANOVA model instead of just one (cf. part 3). To mathematically describe a two-way ANOVA, the one-way ANOVA model must simply be rewritten to take into account the interaction term between the two explanatory variables:
µ is the general mean of the response variable,
Ai is the effect of the level i for the factor A,
Bj is the effect of the level j for the factor B,
AiBj is the interaction term between the two explanatory variables,
i and j vary from 1 to n (n ≥ 2),
εijk are the residuals of the model.
However, the statistical hypotheses for the two-way ANOVA are:
H01: No difference in mean among levels of A; µa1 = µa2 = … = µai =… = µan
H02: No difference in mean among levels of B; µb1 = µb2 = … = µbi =… = µbm
H03: No interaction between factors A and B.
The one-way ANOVA table also has to be rewritten to add the second explanatory term as well as the interaction term. Thus, a two-way ANOVA table corresponds to:
| Source of|
| Degrees of|
|Sums of squares||Mean squares||F-statistic|
a: number of levels of the explanatory variable A; b: number of levels of the explanatory variable B; r: number of replicates per treatment
In R, a two-way ANOVA model is implemented in the same fashion as a one-way ANOVA using the function
Examine the effects of the factors Diet, Aquatic, and their interaction on the maximum bird abundance.
Recall: Before interpreting the ANOVA results, the model must first be validated by verifying the statistical assumptions of ANOVA, namely the:
This verification can be done using the four diagnostic plots as previously explained for one-way ANOVA.
Interactions can also be viewed graphically using the function
interaction.plot(bird$Diet, bird$Aquatic, bird$logMaxAbund, col="black", ylab = expression("log"*"(Maximum Abundance)"), xlab="Diet")
What do the gaps in the line for the Aquatic group mean?
0 1 Insect 14 6 InsectVert 1 1 Plant 2 0 PlantInsect 17 1 Vertebrate 5 7
The design is unbalanced; unequal observations among diet levels for Aquatic (coded as 1) and Terrestrial (coded as 0). See advanced section below for details on unbalanced ANOVA designs.
Test the significance of the Aquatic factor by comparing nested models with and without this categorical variable.
Click to display ⇲
Click to hide ⇱
One-way and two-way ANOVA enabled us to determine the effect of categorical explanatory variables on a continuous response variable for the case of balanced experimental designs (i.e. when all levels of the explanatory variables contain the same number of replicates). However, loss of experimental units over the course of an experiment, or technical restriction of experimental designs can result in unbalanced designs. In this case, the above-mentioned ANOVA tests lead to misleading results related due to incorrect sum of squares calculations. For unbalanced experimental design, ANOVA must be modified to correctly account for the missing values of the response variable.
While mathematical model, statistical hypothesis and statistical assumptions for ANOVA with unbalanced designs remain the same as for ANOVA with balanced designs the sum of squares calculation changes.
For unbalanced design, ANOVA thus test the hypothesis:
H0: µ1 = µ2 =… = µi =… = µn
H1: there is at least one µi that differs from the others.
Using this mathematical model:
Recall the sum of squares calculation of the ANOVA with balanced design:
This corresponds to sequential sum of squares, or type I sum of squares, as the main effect of B is calculated after removing the main effect of A, and the interaction effect is calculated after removing the two main effects. These calculations are sample size dependent as the effect of each factor is calculated after removing the effect of the precedent factor.
For unbalanced design, ANOVA results will depend on the order in which each explanatory variable appears in the model. This can be seen by comparing the results of the following two models:
While the same explanatory variables are used in these two models, the ANOVA tables show different results due to the unbalanced design (i.e. different number of observations for aquatic and non-aquatic birds).
For unbalanced designs, marginal sum of squares, or type III sum of squares that perform calculations of main effect after removing the effect of all other factors ensure independence from sample size effects:
In R, ANOVA with type III sum of squares can be implemented using the Anova function of package ‘car’ and specifying “III” in the type option, for example:
By comparing ANOVA tables of models with different order of explanatory variables, users can now note that the ANOVA table results always remain the same. Type III sum of squares thus correctly calculates the ANOVA results by becoming independent of sample sizes.
After having verifying the model assumptions, results can finally be safely interpreted.
Analysis of covariance (ANCOVA) combines linear regression and ANOVA to test the influence of one categorical explanatory variable (or more) and one continuous explanatory variable (or more) on a continuous response variable. The underlying mathematical model of ANCOVA can be written as:
µ is the general mean of the response variable,
Ai is the treatment effect,
Bi is the effect of the continuous variable,
xij is the covariate measured for observation yij,
i is the average value of the covariate for treatment group i,
i varies from 1 to n (n > 2) treatments,
εij are the residuals of the model.
Notice that this model contains a term Ai for the treatment effect (as in ANOVA) and a slope term Βi for the covariate effect (as in regression). Thus, each treatment group is described by its own slope and intercept. In addition to testing if the response variable differs for at least one level of the categorical variable, ANCOVA also tests whether the response variable might be influenced by its relationship with the continuous variable (called the covariate in ANCOVA), and by any differences between group levels in the way that the continuous variable influences the response (i.e. the interaction). The ANCOVA hypotheses are thus:
H01: There is no effect of the categorical factor (i.e. µ1 = µ2 =… = µi =… = µn)
H02: There is no effect of the continuous factor (i.e. β = 0)
H03: There is no interacting effect of the categorical and continuous factors
As with models seen above, to be valid, ANCOVA models must also meet two statistical assumptions that can be verified using diagnostic plots, i.e.:
Note: A fixed variable is one that you are specifically interested in (i.e. bird mass). In contrast, a random variable is noise that you want to control for (i.e. site a bird was sampled in). If you have random variables, see the workshop on Linear Mixed-effect Models!
You can have any number of factors and/or covariates, but as their number increases, the interpretation of results gets more complex.
The most frequently used ANCOVAs are those with:
The different possible goals of the ANCOVA are to determine the effects of:
Importantly, these goals are only met if there is no significant interaction between the factor(s) and the covariate(s)! Examples of significant interactions between the factor and the covariate (for an ANCOVA with one factor and one covariate) are illustrated by the second and third panels below:
The same logic follows for ANCOVAs with multiple factors and/or covariates.
Running an ANCOVA in R is comparable to running a two-way ANOVA, using the function
lm. However, instead of using two categorical variables (Diet and Aquatic), we now use one categorical and one continuous variable.
For example, using a build in dataset called CO2, where the response variable is uptake, the continuous variable is conc and the factor is Treatment, the ANCOVA is:
If only your factor is significant, drop your covariate from the model: you will then have an ANOVA.
If only your covariate is significant, drop your factor from the model, you will then have a simple linear regression.
If your interaction covariate*factor is significant, you might want to test which level(s) of your factor ha(s)ve different slopes.
In the CO2 example above, both the covariate and factor are significant, but the interaction is non-significant. If your replace Treatment with Type, however, you will see an example of a significant interaction.
If you want to compare means across factors, you can use adjusted means, which uses the equations given by the ANCOVA to estimate the means of each level of the factor, corrected for the effect of the covariate:
install.packages("effects") library(effects) adj.means <- effect('Treatment', ancova.example) plot(adj.means) adj.means <- effect('conc*Treatment', ancova.example) plot(adj.means)
Run an ANCOVA to test the effect of Diet, Mass, and their interaction on MaxAbund.
Multiple regression tests the effects of several continuous explanatory variables on a response variable based on the following mathematical model:
β0 is the intercept of the regression line,
β1 is effect of the variable x1 (i.e. the slope of the regression line of variable x1),
β2 is effect of the variable x2 (i.e. the slope of the regression line of variable x2),
εi are the residuals of the model (i.e. the unexplained variation).
To be valid, each of the explanatory variables used in the multiple regression model must be:
shapiro.test). Usual transformations must be performed in the case of non-normality.
The model residuals assumptions are the same as those of the simple linear regression, that is:
If you see any pattern between two explanatory variables, they are collinear. Collinearity must be avoided as the effect of each explanatory variable will be confounded! Possible solutions are:
The Dickcissel dataset explores environmental variables that drive the abundance and presence/ absence of a grassland bird with peak abundances in Kansas, USA. It contains 15 variables:
|abund|| The number of individuals|
observed at each route
|Present|| Presence/ absence of the|
|Boolean (“Present”/ “Absent”)|
| broadleaf, conif, crop, grass,|
shrub, urban, wetland
| Land use variables within 20 km radius|
of the center route
|NDVI||Vegetation index (a measure of productivity)||Interger|
|clDD, clFD, clTma, clTmi, clP|| Climate date (DD = degree days,|
FD = frost days, Tma = max temperature,
Tmi = min temperature,
P = precipitation)
In R, multiple regression are implemented using the
lm function and its results are viewed using the
summary function. Using, for example, the Dickcissel data, we can test the effects of climate, productivity and land cover on the abundance of the Dickcissel species abundance by applying the model:
Is a transformation needed for the response variable abund?
As you likely noticed in Challenge 5, the abund variable could not be normalized, suggesting that we might need to relax the assumptions of a normally distributed response variable and move on to Generalized Linear Models, but that will wait until later!
For now, let's simply use the untransformed abund and compare the relative importance of the three variables (climate, productivity, and land cover) on abund
The R output enables one to visualize the significant explanatory variables:
lm(formula = abund ~ clTma + NDVI + grass, data = Dickcissel) Residuals: Min 1Q Median 3Q Max -35.327 -11.029 -4.337 2.150 180.725 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -83.60813 11.57745 -7.222 1.46e-12 *** clTma 3.27299 0.40677 8.046 4.14e-15 *** NDVI 0.13716 0.05486 2.500 0.0127 * grass 10.41435 4.68962 2.221 0.0267 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 22.58 on 642 degrees of freedom Multiple R-squared: 0.117, Adjusted R-squared: 0.1128 F-statistic: 28.35 on 3 and 642 DF, p-value: < 2.2e-16
In this case, the three explanatory variables significantly influence the abundance of the Dickcissel species, the most significant one being the climate (p-value=4.14e-15). Altogether these variables explain 11.28% of the Dickcissel abundance variability (Adjusted R-squared= 0.1128). The overall model is also significant and explains the Dickcissel abundance variability better than a null model (p-value: < 2.2e-16).
A plot of the response variable as a function of each explanatory variable can be used to represent graphically the model results:
Click to display ⇲
Click to hide ⇱
Response variables are not always linearly related to explanatory variables. In this case, linear regression that fits a straight line through the two variables is unable to correctly represent the data. Instead, polynomial regression that fits a polynomial curves between the response variable and the explanatory variables can be used to represent non-linear relationship based on the mathematical model:
for a polynomial of Degree 3
for a polynomial of Degree 2
β0 is the intercept of the regression line,
β1 is effect of the variable x,
β2 is effect of the variable x2,
β3 is effect of the variable x3,
εi are the residuals of the model (i.e. the unexplained variation).
The Degree is the largest exponent of that variable. When you know a Degree, you can also give the polynomial a name:
In R, these models are implemented with the
lm function and can be compared to a linear regression with the
The model comparison shows that the quadratic regression (i.e. the polynomial of degree 2) is the best model. The cubic term can thus be dropped from the final model:
Analysis of Variance Table Model 1: abund ~ clDD Model 2: abund ~ clDD + I(clDD^2) Model 3: abund ~ clDD + I(clDD^2) + I(clDD^3) Model Res.Df RSS Df Sum of Sq F Pr(>F) 1 644 365039 2 643 355871 1 9168.3 16.5457 5.34e-05 *** 3 642 355743 1 127.7 0.2304 0.6314
The R output for the final model is:
Call: lm(formula = abund ~ clDD + I(clDD^2), data = Dickcissel) Residuals: Min 1Q Median 3Q Max -14.057 -12.253 -8.674 1.495 190.129 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.968e+01 5.954e+00 -3.306 0.001 ** clDD 1.297e-02 2.788e-03 4.651 4.00e-06 *** I(clDD^2) -1.246e-06 3.061e-07 -4.070 5.28e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 23.53 on 643 degrees of freedom Multiple R-squared: 0.04018, Adjusted R-squared: 0.0372 F-statistic: 13.46 on 2 and 643 DF, p-value: 1.876e-06
In this example, the linear term influenced the response variable more than the quadratic term, as their p-values are respectively 4.00e-06 and 5.28e-05. They explained together 3.72% of the abundance variability (Adjusted R-squared).
To obtain a final multiple regression model, users can first implement a full regression model containing all the explanatory variables and then drop the non-significant variable using a stepwise selection procedure. In this method, non-significant variables are successively dropped one by one from the model and the goodness-of-fit of each successive model are compared based on AIC (Akaike’s Information Criterion), until all the explanatory variables of the model are significant. Note that a lower AIC value indicates a better goodness of fit (i.e., the best model is the one with the lowest AIC). In R, stepwise selection is implemented using the function
According to the stepwise selection, only 6 significant explanatory variables among the 13 tested are retained in the final model:
Call: lm(formula = abund ~ clDD + clFD + clTmi + clTma + clP + grass, data = Dickcissel) Residuals: Min 1Q Median 3Q Max -30.913 -9.640 -3.070 4.217 172.133 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -4.457e+02 3.464e+01 -12.868 < 2e-16 *** clDD 5.534e-02 8.795e-03 6.292 5.83e-10 *** clFD 1.090e+00 1.690e-01 6.452 2.19e-10 *** clTmi -6.717e+00 7.192e-01 -9.339 < 2e-16 *** clTma 3.172e+00 1.288e+00 2.463 0.014030 * clP 1.562e-01 4.707e-02 3.318 0.000959 *** grass 1.066e+01 4.280e+00 2.490 0.013027 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 19.85 on 639 degrees of freedom Multiple R-squared: 0.3207, Adjusted R-squared: 0.3144 F-statistic: 50.29 on 6 and 639 DF, p-value: < 2.2e-16
The model now accounts for 31.44% of the Dickcissel abundance variability, the clTmi being the most significant explanatory variable.
Nevertheless, some of the selected explanatory variables are highly correlated and should be dropped from the final model in order to remove uninformative variables.
Collinearity between explanatory variables can be assessed based on the variance inflation factor using the
vif function of package ‘HH’:
which gives the following output:
clFD clTmi clTma clP grass 13.605855 9.566169 4.811837 3.196599 1.165775
As variance inflation factor higher than 5 represents collinear variables, the R output shows that clDD, clFD and clTmi are highly collinear. Only one of these explanatory variables can thus be retained in the final regression model.
Click to display ⇲
Click to hide ⇱
To assess the relative contribution of two (or more) explanatory datasets to a response variable, the
varpart function of vegan package can be used. This function partitions the explained variance of a response variable to compare the contribution of various sets of explanatory variables. For example, the contribution of land cover variables on the first hand, and of climate variables on the other hand on Dickcissel abundance can be implemented using:
The R output of this function enables one to visualize the variance partitioning results:
Partition of variation in RDA Call: varpart(Y = Dickcissel$abund, X = Dickcissel[, c("clDD", "clFD", "clTmi", "clTma", "clP")], Dickcissel[, c("broadleaf", "conif", "grass", "crop", "urban", "wetland")]) Explanatory tables: X1: Dickcissel[, c("clDD", "clFD", "clTmi", "clTma", "clP")] X2: Dickcissel[, c("broadleaf", "conif", "grass", "crop", "urban", "wetland")] No. of explanatory tables: 2 Total variation (SS): 370770 Variance: 574.84 No. of observations: 646 Partition table: Df R.squared Adj.R.squared Testable [a+b] = X1 5 0.31414 0.30878 TRUE [b+c] = X2 6 0.03654 0.02749 TRUE [a+b+c] = X1+X2 11 0.32378 0.31205 TRUE Individual fractions [a] = X1|X2 5 0.28456 TRUE [b] 0 0.02423 FALSE [c] = X2|X1 6 0.00327 TRUE [d] = Residuals 0.68795 FALSE ---
rda to test significance of fractions of interest
This R output shows that the two explanatory datasets explain 31.205% ([a+b+c] = X1+X2) of the Dickcissel abundance variability, while the climate dataset contributes to 28.46% of the Dickcissel abundance variability ([a] = X1|X2) and the land cover dataset only contributes to 0.33% of the Dickcissel abundance variability ([c] = X2|X1). The interaction of these two datasets also explained 2.42% ([b]) of the variability.
The significance of each fraction can be tested using partial RDA and permutational ANOVA using the functions
# Climate set out.1 = rda(Dickcissel$abund, Dickcissel[,c("clDD", "clFD", "clTmi", "clTma", "clP")], Dickcissel[,c("broadleaf", "conif", "grass", "crop", "urban", "wetland")]) anova(out.1, step=1000, perm.max=1000) # Land cover set out.2 = rda(Dickcissel$abund, Dickcissel[,c("broadleaf", "conif", "grass", "crop", "urban", "wetland")], Dickcissel[,c("clDD", "clFD", "clTmi", "clTma", "clP")]) anova(out.2, step=1000, perm.max=1000)
with the following R ouputs:
Climate set Permutation test for rda under reduced model Model: rda(X = Dickcissel$abund, Y = Dickcissel[, c("clDD", "clFD", "clTmi", "clTma", "clP")], Z = Dickcissel[, c("broadleaf", "conif", "grass", "crop", "urban", "wetland")]) Df Var F N.Perm Pr(>F) Model 5 165.12 53.862 999 0.001 *** Residual 634 388.72 Land cover set Permutation test for rda under reduced model Model: rda(X = Dickcissel$abund, Y=Dickcissel[, c("broadleaf", "conif", "grass", "crop", "urban", "wetland")], Z = Dickcissel[, c("clDD", "clFD", "clTmi", "clTma", "clP")]) Df Var F N.Perm Pr(>F) Model 6 5.54 1.5063 999 0.152 Residual 634 388.72
In this case, the fraction of variance explained by the climate set is significant (p-value=0.001) while the fraction explained by the land cover set is not (p-value=0.152).
The results of variance partitioning are generally graphically represented using Venn diagrams on which each explanatory datasets is represented by a circle inside which their corresponding fraction of explained variance is annotated:
Amazing! You are now ready to perform your own regression, ANOVA and ANCOVA! But never forget to correctly specify your model and verify its statistical assumptions before interpreting its results according to the ecological background of your data.
Some exciting books about linear regression and ANOVA: