This series of 10 workshops walks participants through the steps required to use R for a wide array of statistical analyses relevant to research in biology and ecology. These open-access workshops were created by members of the QCBS both for members of the QCBS and the larger community.
The content of this workshop has been peer-reviewed by several QCBS members. If you would like to suggest modifications, please contact the current series coordinators, listed on the main wiki page
Developed by: Catherine Baltazar, Dalal Hanna, Jacob Ziegler
Summary: Mixed effects models allow ecologists to overcome a number of limitations associated with traditional linear models. In this workshop, you will learn when it is important to use a mixed effects model to analyze your data. We will walk you through the steps to conduct a linear mixed model analysis, check its assumptions, report results, and visually represent your model in R.
Link to associated Prezi: Prezi
Download the R script and data for this lesson:
Biological and ecological data are often messy. Usually, there is an inherent structure to data (i.e. single observations are not always independent), relationships between variables of interest might differ depending on grouping factors like species, and more often than not sample sizes are low making it difficult to fit models that require many parameters to be estimated. Linear mixed effects models (LMM) were developed to deal with these issues. They can be applied to a great number of ecological questions and take many different forms. In this workshop we will use a simple question-based approach to learn the basics of how LMM operate and how to fit them. We will then conclude with a thought experiment of applying LMM to a more diverse set of questions.
LMM's allow you to use all the data you have instead of using means of non-independent samples, they account for structure in your data (for example, quadrates nested in sites nested in forests), they allow relationships to vary by different grouping factors (sometimes referred to as random effects), and they require less parameter estimates than classical regression which saves you degrees of freedom. But how do they do all of this? These questions will hopefully become clear during this section. First, Let's start by getting to know the dataset.
# Remove prior commands in R rm(list=ls()) # Place all workshop material in one folder on your computer # Run the following line of code and use the browsing window to choose the QCBS_W6_Data.csv # file in the folder that contains the workshop material file.choose() # Set the working directory to the folder which contains the lab material by copy and pasting # all but the R file name from the output of the file.choose() command into the set working # directory command. # For example paste "/Users/ziegljac/Documents/QCBS_R/" -> include the quotations # NOT "/Users/ziegljac/Documents/QCBS_R/Get_Data_Func.R" -> include the quotations setwd() # Load useful libraries and data # If you have never loaded these libraries before you will have to use # The install.packages() function before the library() function library(ggplot2) library(lme4) library(arm) library(AICcmodavg) data <- read.csv('QCBS_W6_Data.csv')
The dataset we will be using deals with fish trophic positions. Individuals were sampled over a range of body lengths that come from three different fish species caught in six different lakes. Here is a visual representation to help wrap your head around all of this! Note: only three sizes of fish are shown within each species but in reality there are 10 individuals per species.
A simple question you might answer with this dataset is does fish trophic position increase with fish size for all three fish species? This will be our motivating question for this workshop.
For your fist challenge, reproduce the plots 1-3 from the QCBS_W5_LMM.R code. Observe the plots and try to get a sense of what is occurring. Two key questions to ask yourself are:
One way to analyze this data is to fit linear regressions for each species in each lake. Here is a plot of species 1 in lake 1:
Notice you would have to estimate a slope and intercept parameter for each regression (2 parameters x 3 species X 6 lakes = 36 parameter estimates) and the sample size for each analysis would be 10. There is a decreased chance of finding an effect due to low sample size and increased familywise error rate due to multiple comparisons.
Another way to analyze this data is to fit a single linear regression ignoring species and lake. Again here is the plot of all the data:
Notice you now have a huge sample size and far fewer parameters to estimate. But what about pseudoreplication? Things within the same lake and species will likely be correlated. Also, look at all that noise in the data, surely some of it is due to differences in lakes and species.
In this case we simply want to know if trophic position increases with length and we don't really care that this relationship might differ slightly among species due to biological processes that we didn't measure or among lakes due to unmeasured environmental variables. This is variation we simply want to control for (sometimes referred to as random factors).
LMM's are a balance between separating and lumping. They:
There is a debate in the litterature about the definition of fixed and random effects. There are many possible definitions, and we chose to present those we think are easier to apply when doing your analyses.
When a variable has a fixed effect, data is usually gathered from all it's possible levels. The person doing the analyses is also interested in making conclusions about the levels for which the data was gathered.
Example of a fixed effect: comparing mercury concentration in fish from three different habitats. Habitat has a fixed effect as we sampled in each 3 habitats and we are interested in making conclusions about these specific habitats.
Variables with a random effect are also called random factors, as they are only categorical variables (not continuous variables). A random effect is observed when the data only includes a random sample of the factor's many possible levels, which are all of interest. They usually are grouping factors for which you want to control the effect in your model, but are not interested in their specific effect on the response variable.
Example of a random effect: studying mercury contamination in fish in Ugandan crater lakes. For logistical reasons, you can't sample all the crater lakes, so you sample only 8 of them. However, fish from a given lake might have some sort of correlation between themselves (auto-correlation) since they experience the same environmental conditions. Even though you're not interested in the effect of each lake specifically, you should account for this potential correlation with a random factor (crater lake) in order to make conclusions about crater lakes in general.
In LMM's allowing intercepts and/or slopes to vary by certain factors (random effects) simply means you assume they come from a normal distribution. A mean and standard deviation of that distribution are estimated based on your data. The most likely intercepts and slopes from that distribution are then fit by optimization (ex. maximum likelihood or restricted maximum likelihood).
In the case of species only a mean and standard deviation are estimated for the distribution of species intercepts instead of three separate species intercepts. The mean of this distribution is the 'species level model'. In this example, we only have three species. In general, the more levels you have for a given factor, the more accurately the parameters of the distribution can be estimated (three may be a little low for estimating a mean and standard deviation but it makes simpler graphs!). When you will implement LMM's in R, note that the intercept in the summary is the species level intercept (i.e. the mean of all random intercepts).
Likewise for lake only mean and standard deviation of lake intercepts are estimated saving you the need to estimate 6 different lake intercept parameters. This saves degrees of freedom as less parameter estimates are needed given the amount of data.
The same concept is used for allowing slopes to vary by a given factor (random effect). This is a little harder to visualize than the intercepts. In the case of species only a mean and standard deviation of slope parameters are estimated instead of three separate slopes. Again, when you will implement LMM's in R, the slope in the summary is the species level slope.
If a certain species or lake is data poor like in the example below (we have a balanced design so not the case in our scenario) it will rely more heavily on the group level model for the intercept and slope in that species or lake.
Confidence intervals of intercepts and slopes are adjusted to account for pseudoreplication based on the intraclass correlation coefficient (ICC) - How much variation is there among groups versus within groups? This determines your effective sample size - An adjusted sample sized based on how correlated the data within groups are.
In this scenario the LMM will treat points within lake more like an overall mean because they are highly correlated. Therefore, the effective sample size will be smaller leading to larger confidence intervals around your slope and intercept parameters.
In this scenario the LMM will treat points within lake more independently because things are less correlated within groups compared to among groups. Therefore the effective sample size will be larger leading to smaller confidence intervals around your slope and intercept parameters.
For your second challenge answer these two questions with your neighbours. How will the ICC and confidence intervals be affected in these two scenarios:
The mixed model protocol in R:
2.1: A priori model building and data exploration 2.2: Coding potential models and model selection 2.3: Model validation 2.4: Interpreting results and visualizing the model
We are interested in finding out if trophic position can be predicted by length, while accounting for variation between species and lakes. So we know we want a model that looks something like this:
is the trophic position of individual (i) from lake (j) of species (k)
ε are the residuals of the model (i.e. the unexplained variation).
Always make sure you've done “Housekeeping” on your data before you start building models! Is the data in the right structure? Use the following code to view the structure and variable types within your data frame.
################Section 2######################### # Running a mixed model in R # Four Step Process to build a mixed model in R # 1) A priori model bulding and data exploration#### # i) Map out the model based on a priori knowledge # We know that we want to build a model that evaluates the relationship # bewteen trophic position and length while accounting for lake and species variation # Trophic Position ~ Length + Species + Lake # ii)Housekeeping and data exploration # Ensure that the structure of your data is correct str(data)
Look at the distribution of samples across factors:
This data set is perfectly balanced, but mixed models can deal with unbalanced designs (like we so often have to in Ecology!).
Look at the distribution of continuous variables:
Major skews can lead to problems with model homogeneity down the road. So if necessary, make transformations. In this case, the data seems OK.
Check for collinearity between variables: It is important not to include 2 collinear variables in the same model, otherwise their effects on the response variable will be confounded, i.e. the model can't tell which of the collinear variable is responsible for the variation in the response variable. By default, the model would give a lot of explanatory power to the first variable in the model, and little power to the following variables.
In this example, we have only one explanatory variable, so there is no risk of collinearity. If you did have another explanatory variable (Var2) and wanted to check for collinearity, you could use:
What are some additional measures we could have gotten in the wild that might have been collinear with length?
Consider the scale of your data:
If two variables in the same model have different ranges of scale, the criteria the mixed models use to come up with parameter estimates are likely to return 'convergence errors'. Z-correction standardizes your variables and adjusts for this scaling problem by placing all your variables on the same scale even if they were originally in different units:
Because the data we are dealing with have such variable scales (length is a longer scale than trophic position), we z-correct it.
# Consider the scales of your variables # Note: when 2 variables have very different ranges of scale the criteria mixed models use to come up # with parameter estimates are likely to return 'convergance errors' # Z correcting adjusts for this scaling problem: # What is a z correction?: (z = (x - mean(x))/sd(x)) # Z-correct Length data$Z_Length<-(data$Fish_Length-mean(data$Fish_Length))/sd(data$Fish_Length) # Z-correct Trophic Position data$Z_TP<-(data$Trophic_Pos-mean(data$Trophic_Pos))/sd(data$Trophic_Pos)
To know if a mixed model is necessary for your data set you must establish if it is actually important to account for variation in the factors that might be affecting the relationship that you're interested in,
Lake and Species in this case.
We can do this by:
# Find out if it is important to account for variation in "random effects" # by comparing the residuals of a linear model without the random effects with # the potential random effects lm.test<-lm(Z_TP~Z_Length, data=data) lm.test.resid<-rstandard(lm.test) # Species Effect plot(lm.test.resid~ data$Fish_Species, xlab = "Species", ylab="Standardized residuals") abline(0,0, lty=2) # Lake Effect plot(lm.test.resid~ data$Lake, xlab = "Lake", ylab="Standardized residuals") abline(0,0, lty=2)
For this model, we should keep the random effects because the standardized residuals show variation across lake and species.
Our a priori model
REML (Restricted Maximum Likelihood) is the default estimation method in the “lmer” function. REML estimates can be used when comparing models with the same fixed effects (i.e. nested models). However, if you are comparing models where the fixed effects differ among models then maximum likelihood should be used to fit parameters as this method is not dependent on the coefficients of the fixed effects. Fitting using maximum likelihood is done by setting REML=FALSE in the lmer command.
Re-write the following code so that the slopes of the trophic position/length relationship also vary by lake and species.
To determine if you've built the best linear mixed model based on your a priori knowledge you must compare this “a-priori” model to possible alternative models. With the dataset we are working with there are many potential models that might best fit the data.
Make a list of 7 alternative models to the following model that can be built and compared: Note - If we had different fixed effects among models we would have to indicate “REML=FALSE” to compare with likelihood methods like AIC (see explanation above). However, you should report parameter estimates from the 'best' model using “REML=TRUE”.
Now that we have a list of potential models, we can compare them to get a better idea of what is going on in our dataset and to select the one(s) with the highest predictive power given the data. Models can be compared using the “AICc” function from the “AICcmodavg” package. The Akaike Information Criterion (AIC) is a measure of model quality that can be used to compare models. AICc corrects for bias created by small sample size when estimating AIC.
# 2) Coding potential models and model selection#### # i) Coding all potential models # List of all Potential models--> # Note: you can chose to not code ones that do not make biological sense. # Linear model with no random effects M0<-lm(Z_TP~Z_Length,data=data) # Full model with varying intercepts M1<-lmer(Z_TP~Z_Length + (1|Fish_Species) + (1|Lake), data=data, REML=FALSE) # Full model with varying intercepts and slopes M2<-lmer(Z_TP~Z_Length + (1+Z_Length|Fish_Species) + (1+Z_Length|Lake), data=data, REML=FALSE) # No Lake, varying intercepts only M3<-lmer(Z_TP~Z_Length + (1|Fish_Species), data=data, REML=FALSE) # No Species, varying intercepts only M4<-lmer(Z_TP~Z_Length + (1|Lake), data=data, REML=FALSE) # No Lake, varying intercepts and slopes M5<-lmer(Z_TP~Z_Length + (1+Z_Length|Fish_Species), data=data, REML=FALSE) # No Species, varying intercepts and slopes M6<-lmer(Z_TP~Z_Length + (1+Z_Length|Lake), data=data, REML=FALSE) # Full model with varying intercepts and slopes only varying by lake M7<-lmer(Z_TP~Z_Length + (1|Fish_Species) + (1+Z_Length|Lake), data=data, REML=FALSE) # Full model with varying intercepts and slopes only varying by species M8<-lmer(Z_TP~Z_Length + (1+Z_Length|Fish_Species) + (1|Lake), data=data, REML=FALSE) # ii) Compare models using AICc values # Compute AICc values for each model AICc<-c(AICc(M0), AICc(M1), AICc(M2), AICc(M3), AICc(M4), AICc(M5), AICc(M6), AICc(M7), AICc(M8)) # Put values into one table for easy comparision Model<-c("M0", "M1", "M2", "M3", "M4", "M5", "M6", "M7", "M8") AICtable<-data.frame(Model=Model, AICc=AICc) AICtable # M8 has the lowest AICc value so it is the best fit model # M2 is also a good fit, but all other models are not nearly as good. # Note when you compare models they must be fit by # Maximum Likelihood (ML) and not by Restricted Maximum Likelihood (REML)
The model with the lowest AIC value has the most predictive power given the data. Some suggest that if models are within 2 AICc units of each other then they are equally plausible. In this case, we can take a closer look at M8 and M2, but all other models have such high AICc values that they are not explaining as much variation in the data as M8 and should not be considered as plausible options to explain the patterns in the data
Take 2 minutes with your neighbour to draw out the model structure of M2. Biologically, how does it differ from M8? Why is it not surprising that it's AICc value was 2nd best?
To look at independence plot model fitted values vs residuals values:
The even spread of the residuals suggest that the model is a good fit for the data.
To check the assumption of homogeneity plot residuals vs each covariate in the model:
# B. Look at independence: # i. plot residuals vs each covariate in the model # Fish_Length plot(x = data$Z_Length, y = E1, xlab = "Z Length", ylab = "Normalized residuals") abline(h = 0, lty = 2) # Note: observed groupings are created by the nature of the data because in the data set # we only measured individuals from 5 categories of lengths (big, small, and three groups in between) # Species boxplot(E1 ~ Fish_Species, ylab = "Normalized residuals", data = data, xlab = "Species") abline(h = 0, lty = 2) #Lake boxplot(E1 ~ Lake, ylab = "Normalized residuals", data = data, xlab = "Lake") abline(h = 0, lty = 2)
The equal spread above and below zero indicate that there are no homogeneity problems with these variables.
Ideally you would also do the above analysis with every covariate not in your model as well. If you observe patterns in these plots you will know that there is variation in your dataset that could be accounted for by these covariates that were not included in the model, and so you should re-consider the inclusion of this variable in your model. However, because there are no covariates we measured that we decided not to include in the model this point is not applicable to this analysis.
It is also important to check the distribution of residual error. Normally distributed residuals indicate that your model predictions are not biased high or low:
You can view the model summary using:
# Look at model summary # This allows you to get an idea of the variance explained by the different components # of the model and the "significance" of fixed effects summary(M8)
The output is broken up into descriptions of the Random effects (things we allow to vary based on the normal distribution) and Fixed effects (things we estimate in the same way as classical regression):
To determine if the slope and, therefore, the effect of length on trophic position is significantly different from zero you first have to calculate the confidence interval (CI) for the slope parameter (estimate for Z_Length in the fixed effects section = 0.4223). The CI = Standard Error of the estimate x 1.96 plus or minus the parameter estimate. If the CI overlaps with zero, then the slope is not significantly different from zero at the 0.05 level.
a) What is the slope and confidence interval of the Z_Length variable in the M8 model?
b) Is the Z_Length slope significantly different from 0?
To visualize this model you must obtain the coefficients (intercept and slope) of each component of the model. Overall our group level model coefficients (aka: “coefs”, in this case is just one intercept and slope) can be found in the summary of the model in the fixed effects section. Coefs for each of the model levels (in this case: Lake and Species) that were fit from a normal distribution can be obtained using the “coef()” function.
Two ways to visualize this data is:
1. To show the group level model:
Obtain the parameters of interest
and plot the data with the model overlayed
# Visualizing model results#### # There are several ways of visualizing the results of a mixed model, all of which # involve using the coefficients generated by the model. # So the first step is to obtain the model coefficients to be able to add them to the figures coef(M8) # Now put the coefs into dataframes to make them more easy to manipulate Lake.coef <- as.data.frame(coef(M8)$Lake) colnames(Lake.coef) <- c("Intercept","Slope") Species.coef <- as.data.frame(coef(M8)$Fish_Species) colnames(Species.coef) <- c("Intercept","Slope") # Plot 1 - All Data # Make a plot that includes all the data plot <- ggplot(aes(Z_Length,Z_TP),data=data) Plot_AllData <- plot + geom_point() + xlab("Length (mm)") + ylab("Trophic Position") + labs(title="All Data") + fig # Add a layer that has an abline with the intercept and slope of the relationship between length and # trophic position. # Note that you can obtain the intercept and slope of the fixed factor directly from the model summary summary(M8) Plot_AllData + geom_abline(intercept = -0.0009059, slope =0.4222697)
2. To show species and lake level models:
Obtain the parameters of interest
and plot the data with the model overlay:
# Plot 2 - By Species # Plot the data color coded by Species Plot_BySpecies<-plot + geom_point(aes(colour = factor(Fish_Species)), size = 4) + xlab("Length (mm)") + ylab("Trophic Position") + labs(title="By Species") + fig # Add the regression line with the intercepts and slopes specific to each species Plot_BySpecies + geom_abline(intercept = Species.coef[1,1], slope =Species.coef[1,2], colour="coral2") + geom_abline(intercept = Species.coef[2,1], slope =Species.coef[2,2], colour = "green4") + geom_abline(intercept = Species.coef[3,1], slope =Species.coef[3,2], colour="blue1") # Plot 3 - By Lake # Plot the data color coded by lake Plot_ByLake<-plot + geom_point(aes(colour = factor(Lake)), size = 4) + xlab("Length (mm)") + ylab("Trophic Position") + labs(title="By Lake") + fig # Add in regression lines with the intercepts specific to each lake Plot_ByLake + geom_abline(intercept = Lake.coef[1,1], slope =Lake.coef[1,2], colour="coral2") + geom_abline(intercept = Lake.coef[2,1], slope =Lake.coef[2,2], colour="khaki4") + geom_abline(intercept = Lake.coef[3,1], slope =Lake.coef[3,2], colour="green4") + geom_abline(intercept = Lake.coef[4,1], slope =Lake.coef[4,2], colour="darkgoldenrod") + geom_abline(intercept = Lake.coef[5,1], slope =Lake.coef[5,2], colour="royalblue1") + geom_abline(intercept = Lake.coef[6,1], slope =Lake.coef[6,2], colour="magenta3")
Mixed models are really good at accounting for variation in ecological data while not loosing too many degrees of freedom.
We covered only a small amount of what LMM's can do. Provided below are a couple other thought experiments with data structure similar to the workshop data and two textbooks that fully detail the utility of LMM's.
Situation: You collected biodiversity estimates from 1000 quadrats that are within 10 different sites which are within 10 different forests. You measured the productivity in each quadrat as well. You are mainly interested in knowing if productivity is a good predictor of biodiversity.
What mixed model could you use for this dataset?
Situation: You collected 200 fish from 12 different sites evenly distributed across 4 habitat types found within 1 lake. You measured the length of each fish and the quantity of mercury in its tissue. You are mainly interested in knowing if habitat is a good predictor of mercury concentration.
What mixed model could you use for this data set?
Gelman, A., and Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models (Cambridge University Press).
Zuur, A., Ieno, E.N., Walker, N., Saveliev, A.A., and Smith, G.M. (2009). Mixed effects models and extensions in ecology with R (Springer).