MAJOR UPDATE

As of Fall 2021, this wiki has been discontinued and is no longer being actively developed.

All updated materials and announcements for the QCBS R Workshop Series are now housed on the QCBS R Workshop website. Please update your bookmarks accordingly to avoid outdated material and/or broken links.

Thank you for your understanding,

Your QCBS R Workshop Coordinators.

QCBS R Workshops

This series of 10 workshops walks participants through the steps required to use R for a wide array of statistical analyses relevant to research in biology and ecology. These open-access workshops were created by members of the QCBS both for members of the QCBS and the larger community.

The content of this workshop has been peer-reviewed by several QCBS members. If you would like to suggest modifications, please contact the current series coordinators, listed on the main wiki page

Workshop 7: Linear and generalized linear mixed models

Developed by: Catherine Baltazar, Dalal Hanna, Jacob Ziegler

Section on GLMMs developed by: Cédric Frenette Dussault, Vincent Fugère, Thomas Lamy, Zofia Taranu

Summary: Mixed effects models allow ecologists to overcome a number of limitations associated with traditional linear models. In this workshop, you will learn when it is important to use a mixed effects model to analyze your data. We will walk you through the steps to conduct a linear mixed model analysis, check its assumptions, report results, and visually represent your model in R. We will also build on the previous workshop to combine knowledge on linear mixed models and extend it to generalized linear mixed effect models.

Link to new Rmarkdown presentation

Link to old Prezi presentation

Link to old Prezi presentation covering GLMMs

Download the R script and data for this lesson:

  1. GLMM Script (this is the script for Workshop 6 but lines 434-784 cover the GLMM section of this workshop)
  2. glmm_funs (code to download for use in the GLMM section)

In addition, you can look at a more up-to-date and detailed presentation mixed models.

  1. What is a linear mixed effects model (LMM) and why should I care?
  2. How do I implement LMM's in R?
    1. A priori model building and data exploration
    2. Coding potential models and model selection
    3. Model validation
    4. Interpreting results and visualizing the model
  3. Using LMMs when the assumptions of normality and error distribution are not met (GLMMs).

Biological and ecological data are often messy. Usually, there is an inherent structure to data (i.e. single observations are not always independent), relationships between variables of interest might differ depending on grouping factors like species, and more often than not sample sizes are low making it difficult to fit models that require many parameters to be estimated. Linear mixed effects models (LMM) were developed to deal with these issues. They can be applied to a great number of ecological questions and take many different forms. In this workshop we will use a simple question-based approach to learn the basics of how LMM operate and how to fit them. We will then conclude with a thought experiment of applying LMM to a more diverse set of questions.

LMM's allow you to use all the data you have instead of using means of non-independent samples, they account for structure in your data (for example, quadrates nested in sites nested in forests), they allow relationships to vary by different grouping factors (sometimes referred to as random effects), and they require less parameter estimates than classical regression which saves you degrees of freedom. But how do they do all of this? These questions will hopefully become clear during this section. First, Let's start by getting to know the dataset.

1.1 Introduction to the dataset

| Load your data
# Remove prior commands in R
rm(list=ls()) 
 
# Place all workshop material in one folder on your computer
# Run the following line of code and use the browsing window to choose the QCBS_W6_Data.csv 
# file in the folder that contains the workshop material
file.choose()
 
# Set the working directory to the folder which contains the lab material by copy and pasting
# all but the R file name from the output of the file.choose() command into the set working 
# directory command. 
 
# For example paste "/Users/ziegljac/Documents/QCBS_R/" -> include the quotations
# NOT "/Users/ziegljac/Documents/QCBS_R/Get_Data_Func.R" -> include the quotations
setwd()
 
# Load useful libraries and data
# If you have never loaded these libraries before you will have to use 
# The install.packages() function before the library() function
library(ggplot2)
library(lme4)
library(arm)
library(AICcmodavg)
 
data <- read.csv('QCBS_W6_Data.csv')

The dataset we will be using deals with fish trophic positions. Individuals were sampled over a range of body lengths that come from three different fish species caught in six different lakes. Here is a visual representation to help wrap your head around all of this! Note: only three sizes of fish are shown within each species but in reality there are 10 individuals per species.

A simple question you might answer with this dataset is does fish trophic position increase with fish size for all three fish species? This will be our motivating question for this workshop.


CHALLENGE 1

For your fist challenge, reproduce the plots 1-3 from the QCBS_W5_LMM.R code. Observe the plots and try to get a sense of what is occurring. Two key questions to ask yourself are:

  1. Do I expect all species to increase in trophic position with length? In the exact same way?
  2. Do I expect these relationships to be the same across lakes? What might differ?

Challenge 1 Answer

1.2 How do I analyze this dataset?

Run many separate analyses

One way to analyze this data is to fit linear regressions for each species in each lake. Here is a plot of species 1 in lake 1:

Notice you would have to estimate a slope and intercept parameter for each regression (2 parameters x 3 species X 6 lakes = 36 parameter estimates) and the sample size for each analysis would be 10. There is a decreased chance of finding an effect due to low sample size and increased familywise error rate due to multiple comparisons.

Run one lumped analysis

Another way to analyze this data is to fit a single linear regression ignoring species and lake. Again here is the plot of all the data:

Notice you now have a huge sample size and far fewer parameters to estimate. But what about pseudoreplication? Things within the same lake and species will likely be correlated. Also, look at all that noise in the data, surely some of it is due to differences in lakes and species.

In this case we simply want to know if trophic position increases with length and we don't really care that this relationship might differ slightly among species due to biological processes that we didn't measure or among lakes due to unmeasured environmental variables. This is variation we simply want to control for (sometimes referred to as random factors).

Fit a LMM

LMM's are a balance between separating and lumping. They:

  1. Provide slope and intercept parameters for each species and lake (separating) but estimate fewer parameters than classical regression.
  2. Use all the data available (lumping) while accounting for pseudoreplication and controlling for differences among lakes and species.

1.3 Fixed and Random Effects

There is a debate in the litterature about the definition of fixed and random effects. There are many possible definitions, and we chose to present those we think are easier to apply when doing your analyses.

Fixed effect

When a variable has a fixed effect, data is usually gathered from all it's possible levels. The person doing the analyses is also interested in making conclusions about the levels for which the data was gathered.

Example of a fixed effect: comparing mercury concentration in fish from three different habitats. Habitat has a fixed effect as we sampled in each 3 habitats and we are interested in making conclusions about these specific habitats.

Random effect

Variables with a random effect are also called random factors, as they are only categorical variables (not continuous variables). A random effect is observed when the data only includes a random sample of the factor's many possible levels, which are all of interest. They usually are grouping factors for which you want to control the effect in your model, but are not interested in their specific effect on the response variable.

Example of a random effect: studying mercury contamination in fish in Ugandan crater lakes. For logistical reasons, you can't sample all the crater lakes, so you sample only 8 of them. However, fish from a given lake might have some sort of correlation between themselves (auto-correlation) since they experience the same environmental conditions. Even though you're not interested in the effect of each lake specifically, you should account for this potential correlation with a random factor (crater lake) in order to make conclusions about crater lakes in general.

1.4 How do LMM's work?

A. Intercepts and/or slopes are allowed to vary by lake and species

In LMM's allowing intercepts and/or slopes to vary by certain factors (random effects) simply means you assume they come from a normal distribution. A mean and standard deviation of that distribution are estimated based on your data. The most likely intercepts and slopes from that distribution are then fit by optimization (ex. maximum likelihood or restricted maximum likelihood).

Intercepts

In the case of species only a mean and standard deviation are estimated for the distribution of species intercepts instead of three separate species intercepts. The mean of this distribution is the 'species level model'. In this example, we only have three species. In general, the more levels you have for a given factor, the more accurately the parameters of the distribution can be estimated (three may be a little low for estimating a mean and standard deviation but it makes simpler graphs!). When you will implement LMM's in R, note that the intercept in the summary is the species level intercept (i.e. the mean of all random intercepts).

Likewise for lake only mean and standard deviation of lake intercepts are estimated saving you the need to estimate 6 different lake intercept parameters. This saves degrees of freedom as less parameter estimates are needed given the amount of data.

Slopes

The same concept is used for allowing slopes to vary by a given factor (random effect). This is a little harder to visualize than the intercepts. In the case of species only a mean and standard deviation of slope parameters are estimated instead of three separate slopes. Again, when you will implement LMM's in R, the slope in the summary is the species level slope.

B. Intercepts, slopes, and associated confidence intervals are adjusted to account for the structure of data

If a certain species or lake is data poor like in the example below (we have a balanced design so not the case in our scenario) it will rely more heavily on the group level model for the intercept and slope in that species or lake.

Confidence intervals of intercepts and slopes are adjusted to account for pseudoreplication based on the intraclass correlation coefficient (ICC) - How much variation is there among groups versus within groups? This determines your effective sample size - An adjusted sample sized based on how correlated the data within groups are.

High ICC

In this scenario the LMM will treat points within lake more like an overall mean because they are highly correlated. Therefore, the effective sample size will be smaller leading to larger confidence intervals around your slope and intercept parameters.

Low ICC

In this scenario the LMM will treat points within lake more independently because things are less correlated within groups compared to among groups. Therefore the effective sample size will be larger leading to smaller confidence intervals around your slope and intercept parameters.


CHALLENGE 2

For your second challenge answer these two questions with your neighbours. How will the ICC and confidence intervals be affected in these two scenarios:

  1. Fish trophic positions are not variable among lakes?
  2. Fish trophic positions are similar within lakes but variable among lakes?

Challenge 2 Answer


The mixed model protocol in R:

2.1: A priori model building and data exploration
2.2: Coding potential models and model selection
2.3: Model validation
2.4: Interpreting results and visualizing the model

2.1 A priori model building and data exploration

We are interested in finding out if trophic position can be predicted by length, while accounting for variation between species and lakes. So we know we want a model that looks something like this:

{TP_ijk} ∼ {Length_i} + {Lake_j} + {Species_k} + {ε}

where,

{TP_ijk} is the trophic position of individual (i) from lake (j) of species (k)
and
ε are the residuals of the model (i.e. the unexplained variation).

Data Exploration

Always make sure you've done “Housekeeping” on your data before you start building models! Is the data in the right structure? Use the following code to view the structure and variable types within your data frame.

| Data exploration
################Section 2#########################
# Running a mixed model in R
# Four Step Process to build a mixed model in R
 
# 1) A priori model bulding and data exploration####
# i)  Map out the model based on a priori knowledge
# We know that we want to build a model that evaluates the relationship 
# bewteen trophic position and length while accounting for lake and species variation
# Trophic Position ~ Length + Species + Lake
 
# ii)Housekeeping and data exploration
# Ensure that the structure of your data is correct
str(data)

Output

Look at the distribution of samples across factors:

| Data exploration
# Look at sample distribution across factors to check 
# if there are any major unequal distributions
table(data$Lake)
table(data$Fish_Species)

Output

This data set is perfectly balanced, but mixed models can deal with unbalanced designs (like we so often have to in Ecology!).

Look at the distribution of continuous variables:

| Data exploration
# Look at distribution of continuous variables
# Transform if necessary (will avoid future problems with homogeneity of model residuals)
hist(data$Fish_Length)
hist(data$Trophic_Pos)

Output

Major skews can lead to problems with model homogeneity down the road. So if necessary, make transformations. In this case, the data seems OK.

Check for collinearity between variables: It is important not to include 2 collinear variables in the same model, otherwise their effects on the response variable will be confounded, i.e. the model can't tell which of the collinear variable is responsible for the variation in the response variable. By default, the model would give a lot of explanatory power to the first variable in the model, and little power to the following variables.

In this example, we have only one explanatory variable, so there is no risk of collinearity. If you did have another explanatory variable (Var2) and wanted to check for collinearity, you could use:

| Data exploration
# Check for collinearity between variables
plot(data)
cor(data$Fish_Length, data$Var2)

CHALLENGE 3
What are some additional measures we could have gotten in the wild that might have been collinear with length?

Challenge 3 Answer


Consider the scale of your data:
If two variables in the same model have different ranges of scale, the criteria the mixed models use to come up with parameter estimates are likely to return 'convergence errors'. Z-correction standardizes your variables and adjusts for this scaling problem by placing all your variables on the same scale even if they were originally in different units:
{z} = ({x} - {mean(x)}) / {sd(x)}

Because the data we are dealing with have such variable scales (length is a longer scale than trophic position), we z-correct it.

| Data exploration
# Consider the scales of your variables
# Note: when 2 variables have very different ranges of scale the criteria mixed models use to come up
# with parameter estimates are likely to return 'convergance errors'
# Z correcting adjusts for this scaling problem:
# What is a z correction?: (z = (x - mean(x))/sd(x))
# Z-correct Length
data$Z_Length<-(data$Fish_Length-mean(data$Fish_Length))/sd(data$Fish_Length)
# Z-correct Trophic Position
data$Z_TP<-(data$Trophic_Pos-mean(data$Trophic_Pos))/sd(data$Trophic_Pos)

To know if a mixed model is necessary for your data set you must establish if it is actually important to account for variation in the factors that might be affecting the relationship that you're interested in, Lake and Species in this case.
We can do this by:

  1. Creating a linear model without factors
  2. Calculating the residuals of this linear model
  3. Ploting the residuals of the linear model against the factors of interest
| Data exploration
# Find out if it is important to account for variation in "random effects"
# by comparing the residuals of a linear model without the random effects with 
# the potential random effects
lm.test<-lm(Z_TP~Z_Length, data=data)
lm.test.resid<-rstandard(lm.test)
# Species Effect
plot(lm.test.resid~ data$Fish_Species, xlab = "Species", ylab="Standardized residuals")
abline(0,0, lty=2)
# Lake Effect
plot(lm.test.resid~ data$Lake, xlab = "Lake", ylab="Standardized residuals")
abline(0,0, lty=2)

Output

For this model, we should keep the random effects because the standardized residuals show variation across lake and species.

2.2 Coding potential models and model selection

Our a priori model

{TP_ijk} ∼ {Length_i} + {Lake_j} + {Species_k} + {ε}

coded in R looks like this:

The lmer structure is not intuitive. The basic parts to the function are:

REML (Restricted Maximum Likelihood) is the default estimation method in the “lmer” function. REML estimates can be used when comparing models with the same fixed effects (i.e. nested models). However, if you are comparing models where the fixed effects differ among models then maximum likelihood should be used to fit parameters as this method is not dependent on the coefficients of the fixed effects. Fitting using maximum likelihood is done by setting REML=FALSE in the lmer command.

What about coding varying slopes as well as intercepts?


CHALLENGE 4
Re-write the following code so that the slopes of the trophic position/length relationship also vary by lake and species.

| Coding LMM's
lmer(Z_TP~Z_Length + (1|Lake) + (1|Species), data = data, REML=TRUE)

Challenge 4 Answer


To determine if you've built the best linear mixed model based on your a priori knowledge you must compare this “a-priori” model to possible alternative models. With the dataset we are working with there are many potential models that might best fit the data.


CHALLENGE 5
Make a list of 7 alternative models to the following model that can be built and compared: Note - If we had different fixed effects among models we would have to indicate “REML=FALSE” to compare with likelihood methods like AIC (see explanation above). However, you should report parameter estimates from the 'best' model using “REML=TRUE”.

| Coding LMM's
lmer(Z_TP~Z_Length + (1|Lake) + (1|Species), data = data, REML=TRUE)

Challenge 5 Answer


Now that we have a list of potential models, we can compare them to get a better idea of what is going on in our dataset and to select the one(s) with the highest predictive power given the data. Models can be compared using the “AICc” function from the “AICcmodavg” package. The Akaike Information Criterion (AIC) is a measure of model quality that can be used to compare models. AICc corrects for bias created by small sample size when estimating AIC.

| Comparing LMM's
# 2) Coding potential models and model selection####
# i) Coding all potential models
# List of all Potential models-->
# Note: you can chose to not code ones that do not make biological sense.
# Linear model with no random effects
M0<-lm(Z_TP~Z_Length,data=data)
# Full model with varying intercepts
M1<-lmer(Z_TP~Z_Length + (1|Fish_Species) + (1|Lake), data=data, REML=FALSE)
# Full model with varying intercepts and slopes
M2<-lmer(Z_TP~Z_Length + (1+Z_Length|Fish_Species) + (1+Z_Length|Lake), data=data, REML=FALSE)
# No Lake, varying intercepts only
M3<-lmer(Z_TP~Z_Length + (1|Fish_Species), data=data, REML=FALSE)
# No Species, varying intercepts only
M4<-lmer(Z_TP~Z_Length + (1|Lake), data=data, REML=FALSE)
# No Lake, varying intercepts and slopes
M5<-lmer(Z_TP~Z_Length + (1+Z_Length|Fish_Species), data=data, REML=FALSE)
# No Species, varying intercepts and slopes
M6<-lmer(Z_TP~Z_Length + (1+Z_Length|Lake), data=data, REML=FALSE)
# Full model with varying intercepts and slopes only varying by lake
M7<-lmer(Z_TP~Z_Length + (1|Fish_Species) + (1+Z_Length|Lake), data=data, REML=FALSE)
# Full model with varying intercepts and slopes only varying by species
M8<-lmer(Z_TP~Z_Length + (1+Z_Length|Fish_Species) + (1|Lake), data=data, REML=FALSE)
 
 
# ii) Compare models using AICc values
# Compute AICc values for each model
AICc<-c(AICc(M0), AICc(M1), AICc(M2), AICc(M3), AICc(M4), AICc(M5), AICc(M6), AICc(M7), AICc(M8))
# Put values into one table for easy comparision
Model<-c("M0", "M1", "M2", "M3", "M4", "M5", "M6", "M7", "M8")
AICtable<-data.frame(Model=Model, AICc=AICc)
AICtable
# M8 has the lowest AICc value so it is the best fit model 
# M2 is also a good fit, but all other models are not nearly as good.
# Note when you compare models they must be fit by 
# Maximum Likelihood (ML) and not by Restricted Maximum Likelihood (REML)

Output

The model with the lowest AIC value has the most predictive power given the data. Some suggest that if models are within 2 AICc units of each other then they are equally plausible. In this case, we can take a closer look at M8 and M2, but all other models have such high AICc values that they are not explaining as much variation in the data as M8 and should not be considered as plausible options to explain the patterns in the data


CHALLENGE 6
Take 2 minutes with your neighbour to draw out the model structure of M2. Biologically, how does it differ from M8? Why is it not surprising that it's AICc value was 2nd best?

| Coding LMM's
M2<-lmer(Z_TP~Z_Length + (1+Z_Length|Fish_Species) + (1+Z_Length|Lake), data=data, REML=FALSE)

Challenge 6 Answer


2.3 Model validation

To look at independence plot model fitted values vs residuals values:

| Model validation
# 3) Checking model assumptions####
# Checking for M8
# A. Look at independence: plot fitted values vs residuals
E1 <- resid(M8)
F1<-fitted(M8)
plot(x = F1, 
     y = E1, 
     xlab = "Fitted Values",
     ylab = "Normalized residuals")
abline(h = 0, lty = 2)

Output

The even spread of the residuals suggest that the model is a good fit for the data.

To check the assumption of homogeneity plot residuals vs each covariate in the model:

| Model validation
# B. Look at independence:
# i. plot residuals vs each covariate in the model
# Fish_Length
plot(x = data$Z_Length, 
     y = E1, 
     xlab = "Z Length",
     ylab = "Normalized residuals")
abline(h = 0, lty = 2)
# Note: observed groupings are created by the nature of the data because in the data set 
# we only measured individuals from 5 categories of lengths (big, small, and three groups in between)
# Species
boxplot(E1 ~ Fish_Species,   
        ylab = "Normalized residuals",
        data = data, xlab = "Species")
abline(h = 0, lty = 2)
#Lake
boxplot(E1 ~ Lake,   
        ylab = "Normalized residuals",
        data = data, xlab = "Lake")
abline(h = 0, lty = 2)

Output

The equal spread above and below zero indicate that there are no homogeneity problems with these variables.

Ideally you would also do the above analysis with every covariate not in your model as well. If you observe patterns in these plots you will know that there is variation in your dataset that could be accounted for by these covariates that were not included in the model, and so you should re-consider the inclusion of this variable in your model. However, because there are no covariates we measured that we decided not to include in the model this point is not applicable to this analysis.

It is also important to check the distribution of residual error. Normally distributed residuals indicate that your model predictions are not biased high or low:

| Model validation
# D. Look at normality: histogram
hist(E1)

Output

2.4 Interpreting results and visualizing the model

You can view the model summary using:

| Interpreting results
# Look at model summary
# This allows you to get an idea of the variance explained by the different components 
# of the model and the "significance" of fixed effects
summary(M8)

Output

The output is broken up into descriptions of the Random effects (things we allow to vary based on the normal distribution) and Fixed effects (things we estimate in the same way as classical regression):

To determine if the slope and, therefore, the effect of length on trophic position is significantly different from zero you first have to calculate the confidence interval (CI) for the slope parameter (estimate for Z_Length in the fixed effects section = 0.4223). The CI = Standard Error of the estimate x 1.96 plus or minus the parameter estimate. If the CI overlaps with zero, then the slope is not significantly different from zero at the 0.05 level.


CHALLENGE 7
a) What is the slope and confidence interval of the Z_Length variable in the M8 model?

b) Is the Z_Length slope significantly different from 0?

Challenge 7 Answer


To visualize this model you must obtain the coefficients (intercept and slope) of each component of the model. Overall our group level model coefficients (aka: “coefs”, in this case is just one intercept and slope) can be found in the summary of the model in the fixed effects section. Coefs for each of the model levels (in this case: Lake and Species) that were fit from a normal distribution can be obtained using the “coef()” function.

Two ways to visualize this data is:

  1. To show the group level model
  2. To show species and lake level models

1. To show the group level model:

Obtain the parameters of interest

and plot the data with the model overlayed

| Visualizing model
# Visualizing model results####
# There are several ways of visualizing the results of a mixed model, all of which
# involve using the coefficients generated by the model. 
# So the first step is to obtain the model coefficients to be able to add them to the figures
coef(M8)
# Now put the coefs into dataframes to make them more easy to manipulate
Lake.coef <- as.data.frame(coef(M8)$Lake)
colnames(Lake.coef) <- c("Intercept","Slope")
Species.coef <- as.data.frame(coef(M8)$Fish_Species)
colnames(Species.coef) <- c("Intercept","Slope")
 
# Plot 1 - All Data
# Make a plot that includes all the data
plot <- ggplot(aes(Z_Length,Z_TP),data=data)
Plot_AllData <- plot + geom_point() + xlab("Length (mm)") + ylab("Trophic Position") + 
                labs(title="All Data") + fig
# Add a layer that has an abline with the intercept and slope of the relationship between length and 
# trophic position.
# Note that you can obtain the intercept and slope of the fixed factor directly from the model summary
summary(M8)
Plot_AllData + geom_abline(intercept = -0.0009059, slope =0.4222697)

Output

2. To show species and lake level models:

Obtain the parameters of interest

and plot the data with the model overlay:

| Visualizing model
# Plot 2 - By Species 
# Plot the data color coded by Species
Plot_BySpecies<-plot + geom_point(aes(colour = factor(Fish_Species)), size = 4) + xlab("Length (mm)") +
                ylab("Trophic Position") + labs(title="By Species") + fig
# Add the regression line with the intercepts and slopes specific to each species
Plot_BySpecies + geom_abline(intercept = Species.coef[1,1], slope =Species.coef[1,2], colour="coral2") + 
                 geom_abline(intercept = Species.coef[2,1], slope =Species.coef[2,2], colour = "green4") + 
                 geom_abline(intercept = Species.coef[3,1], slope =Species.coef[3,2], colour="blue1")
 
# Plot 3 - By Lake 
# Plot the data color coded by lake
Plot_ByLake<-plot + geom_point(aes(colour = factor(Lake)), size = 4) + xlab("Length (mm)") + 
                    ylab("Trophic Position") + labs(title="By Lake") + fig
# Add in regression lines with the intercepts specific to each lake
Plot_ByLake + geom_abline(intercept = Lake.coef[1,1], slope =Lake.coef[1,2], colour="coral2") + 
              geom_abline(intercept = Lake.coef[2,1], slope =Lake.coef[2,2], colour="khaki4") + 
              geom_abline(intercept = Lake.coef[3,1], slope =Lake.coef[3,2], colour="green4") + 
              geom_abline(intercept = Lake.coef[4,1], slope =Lake.coef[4,2], colour="darkgoldenrod") + 
              geom_abline(intercept = Lake.coef[5,1], slope =Lake.coef[5,2], colour="royalblue1") + 
              geom_abline(intercept = Lake.coef[6,1], slope =Lake.coef[6,2], colour="magenta3")

Output

Mixed models are really good at accounting for variation in ecological data while not loosing too many degrees of freedom.

We covered only a small amount of what LMM's can do. Provided below are a couple other thought experiments with data structure similar to the workshop data and two textbooks that fully detail the utility of LMM's.


CHALLENGE 8
Situation: You collected biodiversity estimates from 1000 quadrats that are within 10 different sites which are within 10 different forests. You measured the productivity in each quadrat as well. You are mainly interested in knowing if productivity is a good predictor of biodiversity.

What mixed model could you use for this dataset?

Challenge 8 Answer



CHALLENGE 9
Situation: You collected 200 fish from 12 different sites evenly distributed across 4 habitat types found within 1 lake. You measured the length of each fish and the quantity of mercury in its tissue. You are mainly interested in knowing if habitat is a good predictor of mercury concentration.

What mixed model could you use for this data set?

Challenge 9 Answer


As seen above, heterogeneous response distributions can be accurately modelled by relaxing the assumptions of normally distributed errors and letting the errors take on different distribution families (e.g., Poisson or negative binomial). What happens if there is a structure in the dataset such that observations sampled in certain sites are more similar (correlated) than observations sampled in different sites? As we saw in Workshop 6, we can account for this lack of independence by adding random effects in the model. Recall that intercepts and/or slopes are allowed to vary among grouping factors (random effects) and the variability in slopes/ intercepts among groups is assumed to follow a normal distribution. Random effects are also referred to as shrinkage estimates because they represent a weighted average of the group-level and the overall fit (fixed effect). The amount of shrinking towards the overall fit is more severe if the within-group variability is large compared to the among-group variability (i.e., when we have less confidence in the random estimate due to low sample size or high within-group variability, the estimate is 'pulled' towards the fixed effect).

The properties of Linear Mixed Models (Workshop 5) which incorporate random effects and those of Generalized Linear Models (seen above) which handle non-normal data can be combined to effectively model such instances in a Generalized Linear Mixed Model framework. The extension of GLMs to account for this additional structure follows similar steps introduced in the mixed modelling workshop.

To provide a brief overview of GLMMs, we will visit a case study presented by Bolker et al. (2009) and later by Bolker et al. (2011) where the structure (or lack of independence) in the dataset was due to observations being measured across different populations. The authors evaluated the extent to which the response of Arabidopsis thaliana (mouse-ear cress) to its environmental drivers (nutrient availability and herbivory) varied as a result of the population and/or genotype they belonged to.

Using the Arabidopsis dataset and detailed analysis presented by Bolker and colleagues, we will explore the eff ects of nutrient levels, herbivory and their interaction ( fixed effects) on the production Arabidopsis thaliana and the across-population and/or across-genotype variability in these relationships (random eff ects).

| Loading
# First load and view the dataset
dat.tf <- read.csv("Banta_TotalFruits.csv")
str(dat.tf)
 
# X  observation number 
# reg  factor for one of three regions; Netherlands, Spain or Sweden
# popu  factor with a level for each population (random effect)
# gen  factor with a level for each genotype (random effect)
# rack  nuisance factor for one of two greenhouse racks
# nutrient  factor with levels for low (value = 1) or high (value = 8) nutrients (fixed effect)
# amd  factor with levels for no damage or simulated herbivory (apical meristem damage) (fixed effect)
# status  nuisance factor for germination method
# total.fruits  the response variable; an integer indicating the number of fruits per plant 
 
# 2-3 genotypes nested within each of the 9 populations
table(dat.tf$popu,dat.tf$gen)
 
# Housekeeping: make integers into factors, relevel clipping (amd) and rename nutrient levels
dat.tf <- transform(dat.tf,
                    X=factor(X),
                    gen=factor(gen),
                    rack=factor(rack),
                    amd=factor(amd,levels=c("unclipped","clipped")),
                    nutrient=factor(nutrient,label=c("Low","High")))
 
# Install/load packages
if(!require(lme4)){install.packages("lme4")}
require(lme4)
if(!require(coefplot2)){install.packages("coefplot2",repos="http://www.math.mcmaster.ca/bolker/R",type="source")}
require(coefplot2)     
if(!require(reshape)){install.packages("reshape")}
require(reshape) 
if(!require(ggplot2)){install.packages("ggplot2")}
require(ggplot2)
if(!require(plyr)){install.packages("plyr")}
require(plyr)
if(!require(gridExtra)){install.packages("gridExtra")}
require(gridExtra)
if(!require(emdbook)){install.packages("emdbook")}
require(emdbook)
source("glmm_funs.R")

Structure in dataset

When we look at the interaction between nutrient and clipping across the 9 different populations, we note that there is a consistent nutrient e ffect (higher baseline under the High nutrient treatment). The effect of clipping is weaker, but in some populations we note a negative slope.

facet by popu
ggplot(dat.tf,aes(x=amd,y=log(total.fruits+1),colour=nutrient)) +
  geom_point() + 
  stat_summary(aes(x= as.numeric(amd)),fun.y=mean,geom="line") +
  theme_bw() + theme(panel.margin=unit(0,"lines")) +
  scale_color_manual(values=c("#3B9AB2","#F21A00")) + # from Wes Anderson Zissou palette
  facet_wrap(~popu)

A similar plot can be made for the 24 different genotypes (changing facet_wrap(~gen)).

Choosing an error distribution

The response variable is count data which suggests that a Poisson distribution should be used. Recall that an important property of the Poisson distribution is that the variance is equal to the mean. However, as we will see below, the group variances increase with the mean much more rapidly than expected under the Poisson distribution.

| Heterogeneity across groups
# Create new variables that represents every combination nutrient x clipping x random factor
dat.tf <- within(dat.tf,
{
  # genotype x nutrient x clipping
  gna <- interaction(gen,nutrient,amd)
  gna <- reorder(gna, total.fruits, mean)
  # population x nutrient x clipping
  pna <- interaction(popu,nutrient,amd)
  pna <- reorder(pna, total.fruits, mean)
})
 
 
# Boxplot of total fruits vs new variable (genotype x nutrient x clipping)
ggplot(data = dat.tf, aes(factor(x = gna),y = log(total.fruits + 1))) +
   geom_boxplot(colour = "skyblue2", outlier.shape = 21, outlier.colour = "skyblue2") + 
   theme_bw() + theme(axis.text.x=element_text(angle=90)) + 
   stat_summary(fun.y=mean, geom="point", colour = "red") 
 
# Boxplot of total fruits vs new variable (population x nutrient x clipping)
ggplot(data = dat.tf, aes(factor(x = pna),y = log(total.fruits + 1))) +
  geom_boxplot(colour = "skyblue2", outlier.shape = 21, outlier.colour = "skyblue2") + 
  theme_bw() + theme(axis.text.x=element_text(angle=90)) + 
  stat_summary(fun.y=mean, geom="point", colour = "red") 

As illustrated above, there is a large amount of heterogeneity among the group variances even when the response variable is transformed. We also note that some groups have a mean and variance of zero.

To determine which distribution family to use, we can run a diagnostic plot of the group variances vs their respective means. We provide an example below for the genotype x nutrient x clipping grouping.

  1. If we observe a linear relationship between the variance and the mean with a slope = 1, then the Poisson family is appropriate,
  2. If we observe a linear mean-variance relationship with a slope > 1 (i.e. Var = φµ where φ > 1), then the quasi-Poisson family (as introduced above) should be applied,
  3. Finally, a quadratic relationship between the variance and the mean (i.e. Var = µ(1 + α

) or µ(1 + µ/k)), is characteristic of overdispersed data that is driven by an underlying heterogeneity among samples. In this case, the negative binomial (Poisson-gamma) would be more appropriate.

From the plot above we note that a linear quasi-Poisson may be better than the negative binomial, but additional modelling is needed.

Poisson GLMM

Let's build a GLMM using the glmer function of the lme4 package. This model has a random intercept for all genotype and population treatments. We include the nuisance variables (rack and germination method) as fixed effects. Given the mean-variance relationship from above, we most likely will need a model with overdispersion, but let's start with a Poisson model.

| Poisson GLMM
mp1 <- glmer(total.fruits ~ nutrient*amd + rack + status +
               (1|popu)+
               (1|gen),
             data=dat.tf, family="poisson")

We can check for overdispersion using a function overdisp_fun provided by Bolker et al. (2011) which divides the Pearson residuals by the residual degrees of freedom and tests whether these values differ significantly (i.e., whether the ratio of residual deviance to residual df is significantly different from 1). A low p-value indicates that the data are over dispersed (i.e., ratio is significantly different from 1).

| Overdispersion check
# Overdispersion?
overdisp_fun(mp1)
 
# Or as above, we can approximate this by dividing the residual deviance by the residual df
summary(mp1)
# residual deviance = 18253.7 and resid df = 616

The ratio is significantly greater than unity, so as expected, we need to try a different distribution where the variance increases more rapidly than the mean, such as the negative binomial.

Negative binomial (Poisson-gamma) GLMM

Recall that the negative binomial (or Poisson-gamma) distribution meets the assumption that the variance is proportional to the square of the mean.

| NB
# Note: This model converges well if you are running R version 3.0.2, but may not converge with later 
# versions. If you are having convergence issues, please try with version 3.0.2.
 
mnb1 <- glmer.nb(total.fruits ~ nutrient*amd + rack + status + 
               (1|popu)+
               (1|gen),
             data=dat.tf, control=glmerControl(optimizer="bobyqa"))
# Although beyond the scope of this workshop, the new "control" argument specifies the way we
# optimize the parameter values (i.e. by taking the derivative of a function or proceeding by iteration). 
# When taking the derivative is not possible, an iterative algorithm such as bobyqa (Bound Optimization
# BY Quadratic Approximation) is used.
 
# Overdispersion?
overdisp_fun(mnb1)

The ratio is now much closer to unity (although the p-value is still less than 0.05).

Poisson-lognormal GLMM

Another option that we have not yet seen is to use a Poisson-lognormal distribution. This approach deals with overdispersion by implementing an observation-level random effects (OLRE; see Harrison (2014)), which model the extra-Poisson variation in the response variable using a random effect with a unique level for every data point.

A Poisson-lognormal model effectively places a lognormal prior on εi. A Poisson-lognormal distribution with mean µ and lognormal prior variance σ2 has variance:

var(y) = µ + µ2 [exp(σ2) - 1]

In contrast, for the negative binomial, we saw that the distribution was given by:

var(y) = µ + µ2/k

More generally, the variance term σ2 in the Poisson-lognormal distribution will depend on the grouping level we select (e.g., at the individual, genotype or population level). That is, the Poisson-lognormal model can allow for a more flexible approach to assigning the observed aggregation to different sources of heterogeneity. Here, to implement the observation-level random effect, we will evaluate it at the individual level.

| PL GLMM
mpl1 <- glmer(total.fruits ~ nutrient*amd + rack + status + 
               (1|X) +
               (1|popu)+
               (1|gen),
             data=dat.tf, family="poisson", control=glmerControl(optimizer="bobyqa"))
 
overdisp_fun(mpl1)

We did it! The ratio between residual deviance and residual df now meets our criterion (in fact, it is even smaller 1).

Random intercepts

Now that we have the appropriate error distribution, we can test the importance of the random intercepts (for population and genotype) by comparing nested models with and without the random effects of interest using either:

  1. an information theoretic approach (such as, Akaike Information Criterion; AIC), which as we saw in Workshop 6 examines several competing hypotheses (models) simultaneously to identify the model with the highest predictive power given the data. As before, we will use the AICc to correct for small sample sizes. Or,
  2. a frequentist approach (traditional null hypothesis testing or drop1 approach), where the significance of each term is evaluated in turn (by comparing nested models together using the anova() function and the significance of the likelihood ratio test; LRT). It's important to keep in mind that with this approach we are testing a null hypothesis of zero variance for the random effects, but given that we cannot have negative variance, we are testing the parameter on the boundary of its feasible region. Therefore, the reported p value is approximately twice what it should be (i.e., we've truncated half of the possible values that fall below 0).
Evaluating the random intercepts
summary(mpl1)$varcor
 
# popu only
mpl1.popu <- glmer(total.fruits ~ nutrient*amd + rack + status + 
                     (1|X) +
                     (1|popu), 
                     data=dat.tf, family="poisson", control=glmerControl(optimizer="bobyqa"))
 
# gen only
mpl1.gen <-glmer(total.fruits ~ nutrient*amd + rack + status + 
                   (1|X) +
                   (1|gen), 
                   data=dat.tf, family="poisson", control=glmerControl(optimizer="bobyqa"))
 
# IC approach using AICc
ICtab(mpl1, mpl1.popu, mpl1.gen, type = c("AICc"))
#            dAICc df
# mpl1       0.0   10
# mpl1.popu  2.0   9 
# mpl1.gen   16.1  9 
 
# Frequentist approach using LRT
anova(mpl1,mpl1.popu)
# Data: dat.tf
# Df         AIC  BIC      logLik  deviance  Chisq   Chi    Df    Pr(>Chisq)  
# mpl1.popu  9    5017.4   5057.4  -2499.7   4999.4                           
# mpl1       10   5015.4   5059.8  -2497.7   4995.4  4.0639  1    0.04381 *
#   ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
anova(mpl1,mpl1.gen)
# Df        AIC  BIC     logLik deviance  Chisq    Chi     Df   Pr(>Chisq)    
# mpl1.gen  9    5031.5  5071.5 -2506.8   5013.5                             
# mpl1      10   5015.4  5059.8 -2497.7   4995.4   18.177  1   2.014e-05 ***
#   ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The model without the random intercept for genotype is within 2 AICc units of the full model, which indicates that they are equally plausible (i.e., little evidence for including a random intercept for genotype). However, when using the Likelihood approach, and keeping in mind the boundary effect (p-values are inflated by a factor of 2), we note that p « 0.05 in both anova tests. Thus the model with both random terms (mpl1) is selected.

Parameter plots

A graphical representation of the model parameters can be obtained using the coefplot2 function. For example, to view the variance terms of our three random intercepts:

Coefplot2
# Variance terms
coefplot2(mpl1,ptype="vcov",intercept=TRUE,main="Random effect variance")
 
# Fixed effects
coefplot2(mpl1,intercept=TRUE,main="Fixed effect coefficient")

Note that the error bars are only shown for the fixed effects because glmer doesn't give us information on the uncertainty of the variance terms.

Random intercept plots

We can also extract the random effect (or group-level) deviations from the fixed intercept using the ranef function. This will tell us how much the intercept is shifted up or down in particular populations or genotypes relative to the fixed intercept. The deviations can then be plotted using dotplot, which will return a two-facetted plot for each random effect (i.e., popu and gen). Note: the grid.arrange function was used to omit the observation-level random effect (i.e. (1|X)).

dotplot
pp <- list(layout.widths=list(left.padding=0, right.padding=0),
           layout.heights=list(top.padding=0, bottom.padding=0))
r2 <- ranef(mpl1,condVar=TRUE)
d2 <- dotplot(r2, par.settings=pp)
grid.arrange(d2$gen,d2$popu,nrow=1)

From this plot we can see a hint of regional variability among populations where the Spanish populations (SP) have larger values than Swedish (SW) and Dutch (NL) populations. The difference among genotypes seems to be largely driven by genotype 34.

As seen in Workshop 5, we could have also used the coef() function to plot the random effect predictions or “estimates”, where the sum of random intercept deviation (obtained from ranef()) and the fixed effect (obtained from fixef()) is equal to the parameter estimate obtained from coef().

Random slopes

Bonus Section: Random slopes

Final model

We end this section on GLMM with the evaluation of the fixed effects. We first code potential models and compare them using AICc.

AICc approach
mpl2 <- update(mpl1, . ~ . - rack) # model without rack
mpl3 <- update(mpl1, . ~ . - status) # model without status
mpl4 <- update(mpl1, . ~ . - amd:nutrient) # without amd:nutrient interaction
ICtab(mpl1,mpl2,mpl3,mpl4, type = c("AICc"))
#     dAICc df
# mpl1  0.0  10
# mpl4  1.4  9 
# mpl3  1.5  8 
# mpl2 55.0  9 

Alternatively, we can use the functions drop1 and dfun, where dfun converts the AIC values returned by the drop1 into ΔAIC values (producing a similar table to ICtab above).

Frequentist approach (drop1)
(dd_LRT <- drop1(mpl1,test="Chisq"))
# Model:
#   total.fruits ~ nutrient * amd + rack + status + (1 | X) + (1 | popu) + (1 | gen)
#              Df    AIC   LRT   Pr(Chi)    
# <none>          5015.4                     
# rack          1 5070.5 57.083 4.179e-14 ***
# status        2 5017.0  5.612   0.06044 .  
# nutrient:amd  1 5016.8  3.444   0.06349 .  
# ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
(dd_AIC <- dfun(drop1(mpl1)))
#               Df   dAIC
# <none>           0.000
# rack          1 55.083
# status        2  1.612
# nutrient:amd  1  1.444

There is a strong effect of rack (a change in AIC of 55 if we remove this variable), whereas the effects of status and of the interaction term are weak (change in AIC when either is dropped is less than 2). We can thus start by removing the non-signifi cant interaction term so that we can test the main e ffects of nutrient and clipping.

Dropping interaction term
mpl2 <- update(mpl1, . ~ . - and:nutrient)
 
# Use AICc:
mpl3 <- update(mpl2, . ~ . - rack) # model without rack or interaction
mpl4 <- update(mpl2, . ~ . - status) # model without status or interaction
mpl5 <- update(mpl2, . ~ . - nutrient) # without nutrient or interaction
mpl6 <- update(mpl2, . ~ . - amd) # without clipping or interaction
ICtab(mpl2,mpl3,mpl4,mpl5,mpl6, type = c("AICc"))
#     dAICc df
# mpl2   0.0 9 
# mpl4   1.2 7 
# mpl6  10.2 8 
# mpl3  54.2 8 
# mpl5 135.6 8 
 
# Or use drop1:
(dd_LRT2 <- drop1(mpl2,test="Chisq"))
# Model:
#   total.fruits ~ nutrient + amd + rack + status + (1 | X) + (1 | popu) + (1 | gen)
#            Df    AIC     LRT   Pr(Chi)    
#   <none>      5016.8                      
#   nutrient  1 5152.5 137.688 < 2.2e-16 ***
#   amd       1 5027.0  12.218 0.0004734 ***
#   rack      1 5071.0  56.231 6.443e-14 ***
#   status    2 5018.1   5.286 0.0711639 .  
# ---
#   Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
 
(dd_AIC2 <- dfun(drop1(mpl2)))
#           Df    dAIC
# <none>        0.000
# nutrient  1 135.688
# amd       1  10.218
# rack      1  54.231
# status    2   1.286
 
summary(mpl2)

Both the main effects of nutrient and clipping are strong (large change in AIC of 135.6 (mpl5) and 10.2 (mpl6) if either nutrient or clipping are dropped, respectively). Our fi nal model includes the fixed nutrient (strong positive) and clipping (negative) effects, the nuisance variables rack, the observation-level random e ffects (1 | X) to account for over-dispersion and the variation in overall fruit set at the population and genotype levels.

CHALLENGE 10

Using the inverts dataset (larval development times (PLD) of 74 marine invertebrate and vertebrate species reared at different temperatures and time), answer the following questions:

  1. What is the effect of feeding type and climate (fixed effects) on PLD?
  2. Does this relationship vary among taxa (random effects)?
  3. What is the best distribution family for this count data?
  4. Finally, once you determined the best distribution family, re-evaluate your random and fixed effects.

Challenge: Solution 3

Challenge: Solution 2

Challenge: Solution 3

Challenge: Solution 4

LMM Textbooks

Gelman, A., and Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models (Cambridge University Press).

Zuur, A., Ieno, E.N., Walker, N., Saveliev, A.A., and Smith, G.M. (2009). Mixed effects models and extensions in ecology with R (Springer).

GLMM resources

Books

B. Bolker (2009) Ecological Models and Data in R. Princeton University Press.
A. Zuur et al. (2009) Mixed Effects Models and Extensions in Ecology with R. Springer.

Articles

Harrison, X. A., L. Donaldson, M. E. Correa-Cano, J. Evans, D. N. Fisher, C. E. D. Goodwin, B. S. Robinson, D. J. Hodgson, and R. Inger. 2018. A brief introduction to mixed effects modelling and multi-model inference in ecology. PeerJ 6:e4794–32.

Websites

GLMM for ecologists: A great website on GLMM with a Q&A section!