QCBS R Workshops

This series of 10 workshops walks participants through the steps required to use R for a wide array of statistical analyses relevant to research in biology and ecology. These open-access workshops were created by members of the QCBS both for members of the QCBS and the larger community.

The content of this workshop has been peer-reviewed by several QCBS members. If you would like to suggest modifications, please contact the current series coordinators, listed on the main wiki page

Workshop 6: Linear mixed effects models

Developed by: Catherine Baltazar, Dalal Hanna, Jacob Ziegler

Summary: Mixed effects models allow ecologists to overcome a number of limitations associated with traditional linear models. In this workshop, you will learn when it is important to use a mixed effects model to analyze your data. We will walk you through the steps to conduct a linear mixed model analysis, check its assumptions, report results, and visually represent your model in R.

1. What is a linear mixed effects model (LMM) and why should I care?
2. How do I implement LMM's in R?
1. A priori model building and data exploration
2. Coding potential models and model selection
3. Model validation
4. Interpreting results and visualizing the model

Biological and ecological data are often messy. Usually, there is an inherent structure to data (i.e. single observations are not always independent), relationships between variables of interest might differ depending on grouping factors like species, and more often than not sample sizes are low making it difficult to fit models that require many parameters to be estimated. Linear mixed effects models (LMM) were developed to deal with these issues. They can be applied to a great number of ecological questions and take many different forms. In this workshop we will use a simple question-based approach to learn the basics of how LMM operate and how to fit them. We will then conclude with a thought experiment of applying LMM to a more diverse set of questions.

LMM's allow you to use all the data you have instead of using means of non-independent samples, they account for structure in your data (for example, quadrates nested in sites nested in forests), they allow relationships to vary by different grouping factors (sometimes referred to as random effects), and they require less parameter estimates than classical regression which saves you degrees of freedom. But how do they do all of this? These questions will hopefully become clear during this section. First, Let's start by getting to know the dataset.

1.1 Introduction to the dataset

# Remove prior commands in R
rm(list=ls())

# Place all workshop material in one folder on your computer
# Run the following line of code and use the browsing window to choose the QCBS_W6_Data.csv
# file in the folder that contains the workshop material
file.choose()

# Set the working directory to the folder which contains the lab material by copy and pasting
# all but the R file name from the output of the file.choose() command into the set working
# directory command.

# For example paste "/Users/ziegljac/Documents/QCBS_R/" -> include the quotations
# NOT "/Users/ziegljac/Documents/QCBS_R/Get_Data_Func.R" -> include the quotations
setwd()

# Load useful libraries and data
# If you have never loaded these libraries before you will have to use
# The install.packages() function before the library() function
library(ggplot2)
library(lme4)
library(arm)
library(AICcmodavg)

data <- read.csv('QCBS_W6_Data.csv')

The dataset we will be using deals with fish trophic positions. Individuals were sampled over a range of body lengths that come from three different fish species caught in six different lakes. Here is a visual representation to help wrap your head around all of this! Note: only three sizes of fish are shown within each species but in reality there are 10 individuals per species.

A simple question you might answer with this dataset is does fish trophic position increase with fish size for all three fish species? This will be our motivating question for this workshop.

CHALLENGE 1

For your fist challenge, reproduce the plots 1-3 from the QCBS_W5_LMM.R code. Observe the plots and try to get a sense of what is occurring. Two key questions to ask yourself are:

1. Do I expect all species to increase in trophic position with length? In the exact same way?
2. Do I expect these relationships to be the same across lakes? What might differ?

1.2 How do I analyze this dataset?

Run many separate analyses

One way to analyze this data is to fit linear regressions for each species in each lake. Here is a plot of species 1 in lake 1:

Notice you would have to estimate a slope and intercept parameter for each regression (2 parameters x 3 species X 6 lakes = 36 parameter estimates) and the sample size for each analysis would be 10. There is a decreased chance of finding an effect due to low sample size and increased familywise error rate due to multiple comparisons.

Run one lumped analysis

Another way to analyze this data is to fit a single linear regression ignoring species and lake. Again here is the plot of all the data:

Notice you now have a huge sample size and far fewer parameters to estimate. But what about pseudoreplication? Things within the same lake and species will likely be correlated. Also, look at all that noise in the data, surely some of it is due to differences in lakes and species.

In this case we simply want to know if trophic position increases with length and we don't really care that this relationship might differ slightly among species due to biological processes that we didn't measure or among lakes due to unmeasured environmental variables. This is variation we simply want to control for (sometimes referred to as random factors).

Fit a LMM

LMM's are a balance between separating and lumping. They:

1. Provide slope and intercept parameters for each species and lake (separating) but estimate fewer parameters than classical regression.
2. Use all the data available (lumping) while accounting for pseudoreplication and controlling for differences among lakes and species.

1.3 Fixed and Random Effects

There is a debate in the litterature about the definition of fixed and random effects. There are many possible definitions, and we chose to present those we think are easier to apply when doing your analyses.

Fixed effect

When a variable has a fixed effect, data is usually gathered from all it's possible levels. The person doing the analyses is also interested in making conclusions about the levels for which the data was gathered.

Example of a fixed effect: comparing mercury concentration in fish from three different habitats. Habitat has a fixed effect as we sampled in each 3 habitats and we are interested in making conclusions about these specific habitats.

Random effect

Variables with a random effect are also called random factors, as they are only categorical variables (not continuous variables). A random effect is observed when the data only includes a random sample of the factor's many possible levels, which are all of interest. They usually are grouping factors for which you want to control the effect in your model, but are not interested in their specific effect on the response variable.

Example of a random effect: studying mercury contamination in fish in Ugandan crater lakes. For logistical reasons, you can't sample all the crater lakes, so you sample only 8 of them. However, fish from a given lake might have some sort of correlation between themselves (auto-correlation) since they experience the same environmental conditions. Even though you're not interested in the effect of each lake specifically, you should account for this potential correlation with a random factor (crater lake) in order to make conclusions about crater lakes in general.

1.4 How do LMM's work?

A. Intercepts and/or slopes are allowed to vary by lake and species

In LMM's allowing intercepts and/or slopes to vary by certain factors (random effects) simply means you assume they come from a normal distribution. A mean and standard deviation of that distribution are estimated based on your data. The most likely intercepts and slopes from that distribution are then fit by optimization (ex. maximum likelihood or restricted maximum likelihood).

Intercepts

In the case of species only a mean and standard deviation are estimated for the distribution of species intercepts instead of three separate species intercepts. The mean of this distribution is the 'species level model'. In this example, we only have three species. In general, the more levels you have for a given factor, the more accurately the parameters of the distribution can be estimated (three may be a little low for estimating a mean and standard deviation but it makes simpler graphs!). When you will implement LMM's in R, note that the intercept in the summary is the species level intercept (i.e. the mean of all random intercepts).

Likewise for lake only mean and standard deviation of lake intercepts are estimated saving you the need to estimate 6 different lake intercept parameters. This saves degrees of freedom as less parameter estimates are needed given the amount of data.

Slopes

The same concept is used for allowing slopes to vary by a given factor (random effect). This is a little harder to visualize than the intercepts. In the case of species only a mean and standard deviation of slope parameters are estimated instead of three separate slopes. Again, when you will implement LMM's in R, the slope in the summary is the species level slope.

B. Intercepts, slopes, and associated confidence intervals are adjusted to account for the structure of data

If a certain species or lake is data poor like in the example below (we have a balanced design so not the case in our scenario) it will rely more heavily on the group level model for the intercept and slope in that species or lake.

Confidence intervals of intercepts and slopes are adjusted to account for pseudoreplication based on the intraclass correlation coefficient (ICC) - How much variation is there among groups versus within groups? This determines your effective sample size - An adjusted sample sized based on how correlated the data within groups are.

High ICC

In this scenario the LMM will treat points within lake more like an overall mean because they are highly correlated. Therefore, the effective sample size will be smaller leading to larger confidence intervals around your slope and intercept parameters.

Low ICC

In this scenario the LMM will treat points within lake more independently because things are less correlated within groups compared to among groups. Therefore the effective sample size will be larger leading to smaller confidence intervals around your slope and intercept parameters.

CHALLENGE 2

For your second challenge answer these two questions with your neighbours. How will the ICC and confidence intervals be affected in these two scenarios:

1. Fish trophic positions are not variable among lakes?
2. Fish trophic positions are similar within lakes but variable among lakes?

The mixed model protocol in R:

2.1: A priori model building and data exploration
2.2: Coding potential models and model selection
2.3: Model validation
2.4: Interpreting results and visualizing the model

2.1 A priori model building and data exploration

We are interested in finding out if trophic position can be predicted by length, while accounting for variation between species and lakes. So we know we want a model that looks something like this:

where,

is the trophic position of individual (i) from lake (j) of species (k)
and
ε are the residuals of the model (i.e. the unexplained variation).

Data Exploration

Always make sure you've done “Housekeeping” on your data before you start building models! Is the data in the right structure? Use the following code to view the structure and variable types within your data frame.

| Data exploration
################Section 2#########################
# Running a mixed model in R
# Four Step Process to build a mixed model in R

# 1) A priori model bulding and data exploration####
# i)  Map out the model based on a priori knowledge
# We know that we want to build a model that evaluates the relationship
# bewteen trophic position and length while accounting for lake and species variation
# Trophic Position ~ Length + Species + Lake

# ii)Housekeeping and data exploration
# Ensure that the structure of your data is correct
str(data)

Output

Look at the distribution of samples across factors:

| Data exploration
# Look at sample distribution across factors to check
# if there are any major unequal distributions
table(data$Lake) table(data$Fish_Species)

Output

This data set is perfectly balanced, but mixed models can deal with unbalanced designs (like we so often have to in Ecology!).

Look at the distribution of continuous variables:

| Data exploration
# Look at distribution of continuous variables
# Transform if necessary (will avoid future problems with homogeneity of model residuals)
hist(data$Fish_Length) hist(data$Trophic_Pos)

Output

Major skews can lead to problems with model homogeneity down the road. So if necessary, make transformations. In this case, the data seems OK.

Check for collinearity between variables: It is important not to include 2 collinear variables in the same model, otherwise their effects on the response variable will be confounded, i.e. the model can't tell which of the collinear variable is responsible for the variation in the response variable. By default, the model would give a lot of explanatory power to the first variable in the model, and little power to the following variables.

In this example, we have only one explanatory variable, so there is no risk of collinearity. If you did have another explanatory variable (Var2) and wanted to check for collinearity, you could use:

| Data exploration
# Check for collinearity between variables
plot(data)
cor(data$Fish_Length, data$Var2)

CHALLENGE 3
What are some additional measures we could have gotten in the wild that might have been collinear with length?

Consider the scale of your data:
If two variables in the same model have different ranges of scale, the criteria the mixed models use to come up with parameter estimates are likely to return 'convergence errors'. Z-correction standardizes your variables and adjusts for this scaling problem by placing all your variables on the same scale even if they were originally in different units:

Because the data we are dealing with have such variable scales (length is a longer scale than trophic position), we z-correct it.

| Data exploration
# Consider the scales of your variables
# Note: when 2 variables have very different ranges of scale the criteria mixed models use to come up
# with parameter estimates are likely to return 'convergance errors'
# Z correcting adjusts for this scaling problem:
# What is a z correction?: (z = (x - mean(x))/sd(x))
# Z-correct Length
data$Z_Length<-(data$Fish_Length-mean(data$Fish_Length))/sd(data$Fish_Length)
# Z-correct Trophic Position
data$Z_TP<-(data$Trophic_Pos-mean(data$Trophic_Pos))/sd(data$Trophic_Pos)

To know if a mixed model is necessary for your data set you must establish if it is actually important to account for variation in the factors that might be affecting the relationship that you're interested in, Lake and Species in this case.
We can do this by:

1. Creating a linear model without factors
2. Calculating the residuals of this linear model
3. Ploting the residuals of the linear model against the factors of interest
| Data exploration
# Find out if it is important to account for variation in "random effects"
# by comparing the residuals of a linear model without the random effects with
# the potential random effects
lm.test<-lm(Z_TP~Z_Length, data=data)
lm.test.resid<-rstandard(lm.test)
# Species Effect
plot(lm.test.resid~ data$Fish_Species, xlab = "Species", ylab="Standardized residuals") abline(0,0, lty=2) # Lake Effect plot(lm.test.resid~ data$Lake, xlab = "Lake", ylab="Standardized residuals")
abline(0,0, lty=2)

Output

For this model, we should keep the random effects because the standardized residuals show variation across lake and species.

2.2 Coding potential models and model selection

Our a priori model

coded in R looks like this:

The lmer structure is not intuitive. The basic parts to the function are:

REML (Restricted Maximum Likelihood) is the default estimation method in the “lmer” function. REML estimates can be used when comparing models with the same fixed effects (i.e. nested models). However, if you are comparing models where the fixed effects differ among models then maximum likelihood should be used to fit parameters as this method is not dependent on the coefficients of the fixed effects. Fitting using maximum likelihood is done by setting REML=FALSE in the lmer command.

What about coding varying slopes as well as intercepts?

CHALLENGE 4
Re-write the following code so that the slopes of the trophic position/length relationship also vary by lake and species.

| Coding LMM's
lmer(Z_TP~Z_Length + (1|Lake) + (1|Species), data = data, REML=TRUE)

To determine if you've built the best linear mixed model based on your a priori knowledge you must compare this “a-priori” model to possible alternative models. With the dataset we are working with there are many potential models that might best fit the data.

CHALLENGE 5
Make a list of 7 alternative models to the following model that can be built and compared: Note - If we had different fixed effects among models we would have to indicate “REML=FALSE” to compare with likelihood methods like AIC (see explanation above). However, you should report parameter estimates from the 'best' model using “REML=TRUE”.

| Coding LMM's
lmer(Z_TP~Z_Length + (1|Lake) + (1|Species), data = data, REML=TRUE)

Now that we have a list of potential models, we can compare them to get a better idea of what is going on in our dataset and to select the one(s) with the highest predictive power given the data. Models can be compared using the “AICc” function from the “AICcmodavg” package. The Akaike Information Criterion (AIC) is a measure of model quality that can be used to compare models. AICc corrects for bias created by small sample size when estimating AIC.

| Comparing LMM's
# 2) Coding potential models and model selection####
# i) Coding all potential models
# List of all Potential models-->
# Note: you can chose to not code ones that do not make biological sense.
# Linear model with no random effects
M0<-lm(Z_TP~Z_Length,data=data)
# Full model with varying intercepts
M1<-lmer(Z_TP~Z_Length + (1|Fish_Species) + (1|Lake), data=data, REML=FALSE)
# Full model with varying intercepts and slopes
M2<-lmer(Z_TP~Z_Length + (1+Z_Length|Fish_Species) + (1+Z_Length|Lake), data=data, REML=FALSE)
# No Lake, varying intercepts only
M3<-lmer(Z_TP~Z_Length + (1|Fish_Species), data=data, REML=FALSE)
# No Species, varying intercepts only
M4<-lmer(Z_TP~Z_Length + (1|Lake), data=data, REML=FALSE)
# No Lake, varying intercepts and slopes
M5<-lmer(Z_TP~Z_Length + (1+Z_Length|Fish_Species), data=data, REML=FALSE)
# No Species, varying intercepts and slopes
M6<-lmer(Z_TP~Z_Length + (1+Z_Length|Lake), data=data, REML=FALSE)
# Full model with varying intercepts and slopes only varying by lake
M7<-lmer(Z_TP~Z_Length + (1|Fish_Species) + (1+Z_Length|Lake), data=data, REML=FALSE)
# Full model with varying intercepts and slopes only varying by species
M8<-lmer(Z_TP~Z_Length + (1+Z_Length|Fish_Species) + (1|Lake), data=data, REML=FALSE)

# ii) Compare models using AICc values
# Compute AICc values for each model
AICc<-c(AICc(M0), AICc(M1), AICc(M2), AICc(M3), AICc(M4), AICc(M5), AICc(M6), AICc(M7), AICc(M8))
# Put values into one table for easy comparision
Model<-c("M0", "M1", "M2", "M3", "M4", "M5", "M6", "M7", "M8")
AICtable<-data.frame(Model=Model, AICc=AICc)
AICtable
# M8 has the lowest AICc value so it is the best fit model
# M2 is also a good fit, but all other models are not nearly as good.
# Note when you compare models they must be fit by
# Maximum Likelihood (ML) and not by Restricted Maximum Likelihood (REML)

Output

The model with the lowest AIC value has the most predictive power given the data. Some suggest that if models are within 2 AICc units of each other then they are equally plausible. In this case, we can take a closer look at M8 and M2, but all other models have such high AICc values that they are not explaining as much variation in the data as M8 and should not be considered as plausible options to explain the patterns in the data

CHALLENGE 6
Take 2 minutes with your neighbour to draw out the model structure of M2. Biologically, how does it differ from M8? Why is it not surprising that it's AICc value was 2nd best?

| Coding LMM's
M2<-lmer(Z_TP~Z_Length + (1+Z_Length|Fish_Species) + (1+Z_Length|Lake), data=data, REML=FALSE)

2.3 Model validation

To look at independence plot model fitted values vs residuals values:

| Model validation
# 3) Checking model assumptions####
# Checking for M8
# A. Look at independence: plot fitted values vs residuals
E1 <- resid(M8)
F1<-fitted(M8)
plot(x = F1,
y = E1,
xlab = "Fitted Values",
ylab = "Normalized residuals")
abline(h = 0, lty = 2)

Output

The even spread of the residuals suggest that the model is a good fit for the data.

To check the assumption of homogeneity plot residuals vs each covariate in the model:

| Model validation
# B. Look at independence:
# i. plot residuals vs each covariate in the model
# Fish_Length
plot(x = data$Z_Length, y = E1, xlab = "Z Length", ylab = "Normalized residuals") abline(h = 0, lty = 2) # Note: observed groupings are created by the nature of the data because in the data set # we only measured individuals from 5 categories of lengths (big, small, and three groups in between) # Species boxplot(E1 ~ Fish_Species, ylab = "Normalized residuals", data = data, xlab = "Species") abline(h = 0, lty = 2) #Lake boxplot(E1 ~ Lake, ylab = "Normalized residuals", data = data, xlab = "Lake") abline(h = 0, lty = 2) Output The equal spread above and below zero indicate that there are no homogeneity problems with these variables. Ideally you would also do the above analysis with every covariate not in your model as well. If you observe patterns in these plots you will know that there is variation in your dataset that could be accounted for by these covariates that were not included in the model, and so you should re-consider the inclusion of this variable in your model. However, because there are no covariates we measured that we decided not to include in the model this point is not applicable to this analysis. It is also important to check the distribution of residual error. Normally distributed residuals indicate that your model predictions are not biased high or low: | Model validation # D. Look at normality: histogram hist(E1) Output 2.4 Interpreting results and visualizing the model You can view the model summary using: | Interpreting results # Look at model summary # This allows you to get an idea of the variance explained by the different components # of the model and the "significance" of fixed effects summary(M8) Output The output is broken up into descriptions of the Random effects (things we allow to vary based on the normal distribution) and Fixed effects (things we estimate in the same way as classical regression): To determine if the slope and, therefore, the effect of length on trophic position is significantly different from zero you first have to calculate the confidence interval (CI) for the slope parameter (estimate for Z_Length in the fixed effects section = 0.4223). The CI = Standard Error of the estimate x 1.96 plus or minus the parameter estimate. If the CI overlaps with zero, then the slope is not significantly different from zero at the 0.05 level. CHALLENGE 7 a) What is the slope and confidence interval of the Z_Length variable in the M8 model? b) Is the Z_Length slope significantly different from 0? Challenge 7 Answer To visualize this model you must obtain the coefficients (intercept and slope) of each component of the model. Overall our group level model coefficients (aka: “coefs”, in this case is just one intercept and slope) can be found in the summary of the model in the fixed effects section. Coefs for each of the model levels (in this case: Lake and Species) that were fit from a normal distribution can be obtained using the “coef()” function. Two ways to visualize this data is: 1. To show the group level model 2. To show species and lake level models 1. To show the group level model: Obtain the parameters of interest and plot the data with the model overlayed | Visualizing model # Visualizing model results#### # There are several ways of visualizing the results of a mixed model, all of which # involve using the coefficients generated by the model. # So the first step is to obtain the model coefficients to be able to add them to the figures coef(M8) # Now put the coefs into dataframes to make them more easy to manipulate Lake.coef <- as.data.frame(coef(M8)$Lake)
colnames(Lake.coef) <- c("Intercept","Slope")
Species.coef <- as.data.frame(coef(M8)\$Fish_Species)
colnames(Species.coef) <- c("Intercept","Slope")

# Plot 1 - All Data
# Make a plot that includes all the data
plot <- ggplot(aes(Z_Length,Z_TP),data=data)
Plot_AllData <- plot + geom_point() + xlab("Length (mm)") + ylab("Trophic Position") +
labs(title="All Data") + fig
# Add a layer that has an abline with the intercept and slope of the relationship between length and
# trophic position.
# Note that you can obtain the intercept and slope of the fixed factor directly from the model summary
summary(M8)
Plot_AllData + geom_abline(intercept = -0.0009059, slope =0.4222697)

Output

2. To show species and lake level models:

Obtain the parameters of interest

and plot the data with the model overlay:

| Visualizing model
# Plot 2 - By Species
# Plot the data color coded by Species
Plot_BySpecies<-plot + geom_point(aes(colour = factor(Fish_Species)), size = 4) + xlab("Length (mm)") +
ylab("Trophic Position") + labs(title="By Species") + fig
# Add the regression line with the intercepts and slopes specific to each species
Plot_BySpecies + geom_abline(intercept = Species.coef[1,1], slope =Species.coef[1,2], colour="coral2") +
geom_abline(intercept = Species.coef[2,1], slope =Species.coef[2,2], colour = "green4") +
geom_abline(intercept = Species.coef[3,1], slope =Species.coef[3,2], colour="blue1")

# Plot 3 - By Lake
# Plot the data color coded by lake
Plot_ByLake<-plot + geom_point(aes(colour = factor(Lake)), size = 4) + xlab("Length (mm)") +
ylab("Trophic Position") + labs(title="By Lake") + fig
# Add in regression lines with the intercepts specific to each lake
Plot_ByLake + geom_abline(intercept = Lake.coef[1,1], slope =Lake.coef[1,2], colour="coral2") +
geom_abline(intercept = Lake.coef[2,1], slope =Lake.coef[2,2], colour="khaki4") +
geom_abline(intercept = Lake.coef[3,1], slope =Lake.coef[3,2], colour="green4") +
geom_abline(intercept = Lake.coef[4,1], slope =Lake.coef[4,2], colour="darkgoldenrod") +
geom_abline(intercept = Lake.coef[5,1], slope =Lake.coef[5,2], colour="royalblue1") +
geom_abline(intercept = Lake.coef[6,1], slope =Lake.coef[6,2], colour="magenta3")

Output

Mixed models are really good at accounting for variation in ecological data while not loosing too many degrees of freedom.

We covered only a small amount of what LMM's can do. Provided below are a couple other thought experiments with data structure similar to the workshop data and two textbooks that fully detail the utility of LMM's.

CHALLENGE 8
Situation: You collected biodiversity estimates from 1000 quadrats that are within 10 different sites which are within 10 different forests. You measured the productivity in each quadrat as well. You are mainly interested in knowing if productivity is a good predictor of biodiversity.

What mixed model could you use for this dataset?

CHALLENGE 9
Situation: You collected 200 fish from 12 different sites evenly distributed across 4 habitat types found within 1 lake. You measured the length of each fish and the quantity of mercury in its tissue. You are mainly interested in knowing if habitat is a good predictor of mercury concentration.

What mixed model could you use for this data set?