This series of 10 workshops walks participants through the steps required to use R for a wide array of statistical analyses relevant to research in biology and ecology. These open-access workshops were created by members of the Quebec Centre for Biodiversity Science both for members of the QCBS and the larger community.
The content of this workshop has been peer-reviewed by several QCBS members. If you would like to suggest modifications, please contact the current series coordinators, listed on the main wiki page
Developed by: Eric Pedersen and Zofia Taranu
Summary: The goal of today's workshop will be to first examine what we mean by a non-linear model, and how GAMs (Generalized Additive Models) allow us to handle non-linear relationships. We will also go over how to plot and interpret these non-linear relationships, how to include interaction terms, autocorrelated errors and expand on previous workshop by briefly examining a mixed modelling framework. Lastly, we will briefly touch upon what GAMs are doing behind the scenes.
Link to associated Prezi: Prezi
Download the R script and data for this lesson:
Prerequisites: Some experience in R (enough to be able to run a script and examine data and R objects) and a basic knowledge of regression (you should know what we mean by linear regression and ANOVA).
What do we mean by the linear model? Regression is the workhorse of statistics. It allows us to model a response variable as a function of predictors plus error. Linear regression is what most people first encounter in statistics. As we saw in the linear models workshop, regression makes four major assumptions:
There's only one way for the linear model to be right:
And yet so many ways for it to fail:
So how can we fix it? We must first know what the regression model is trying to do:
Linear models do this by finding the best fit straight line that passes through the data. In contrast, additive models do this by fitting a curve through the data, but controlling how wiggly the line can get (more on this later).
Let's look at an example. First, we'll generate some data, and plot it.
We can appreciate that if we were to fit this as a linear regression model, we would violate the assumptions listed above. Let's have a look, using the
gam() command from the mgcv package here to model an ordinary least squares regression (we will see below how to use
gam() to specify a smoothed, non-linear term).
We can see in the model summary that our linear model explains quite a bit of variance (R2(adj) = 0.639), however, diagnostic plots of the model residuals would quickly show that the error variance is not normally distributed nor homoscedastic, and that an important non-linear pattern remains. Let's now try to fix this by fitting the data using a smoothed (non-linear) term.
We will revisit this later, but briefly, GAMs are effectively a nonparametric form of regression where the *xi of a linear regression is replaced by an arbitrary smooth functions of the explanatory variables, f(xi), and the model becomes:
where yi is the response variable, xi covariate, and f is the smooth function.
Importantly, given that the smooth function f(xi) is non-linear and local, the magnitude of the effect of the explanatory variable can vary over its range, depending on the relationship between the variable and the response. That is, as opposed to one fixed coefficient , the function f can continually change over the range of xi. The degree of smoothness (or wiggliness) of f is controlled using penalized regression determined automatically in mgcv using a generalized cross-validation (GCV) routine (Wood 2006).
gam() smooth terms are specified by expressions of the form:
s(x) where x is the covariate which the smooth is a function of.
The variance explained by our model has increased by 20% (R2(adj) = 0.859) and when we compare the fit of the linear (red) and non-linear (blue) models, it is clear that the latter is the winner.
The mgcv package also includes a default plot to look at the smooths:
How do we test whether the non-linear model offers a significant improvement over the linear model? We can use the
anova() commands to formally test whether an assumption of linearity is justified. To do, we must simply set our smoothed model so that it is nested in our linear model; that is, we create model object that includes both
x (linear) and
s(x) (non-linear) and we ask whether adding
s(x) to the model with only
x as a covariate is supported by the data.
The non-linear term is significant:
Analysis of Deviance Table Model 1: y_obs ~ x Model 2: y_obs ~ s(x) + x Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 248.00 6.5846 2 240.68 2.4988 7.3168 4.0858 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Note that the model
y_obs~s(x) gives exactly the same results as
y_obs~s(x)+x, as shown above for the
nested_gam_model. We used the
s(x)+x to illustrate the nestedness of the model, however, we will omit the
+x in the nested model comparisons that follow.
We will now try this test with some new simulated data, just to get a handle on it. First generate the data using the code below, then fit a linear and smoothed GAM model to the relation between
What is the effective degrees of freedom of the smoothed term? Determine if linearity is justified for this data.
GAMs make it easy to include both smooth and linear terms, multiple smoothed terms, and smoothed interactions. For this section, we'll be using a dataset automatically generated by the
gamSim() command of the mgcv package. For further details on what type of data you can simulate with this function, see
?gamSim. For this section, we will use the example 5 (eg=5) of gamSim, which simulates an additive example plus a factor variable.
We will now look at how we can predict
y based on the other variables. Start with a basic model, with one smoothed term (x1) and one categorical predictor (x0, which has 4 levels).
p.table provides the significance table for each parametric term and the
s.table is the significance table for smooths. Note that for the latter, the wiggleness of the smoothed term
s(x1) is indicated by the edf parameter (effective degrees of freedom); the higher the edf, the more non-linear the smoothing spline. A high value (8–10 or higher) means that the curve is highly non-linear, whereas a smoother with 1 effective degree of freedom is a straight line. In contrast, in linear regression the model degrees of freedom is equivalent to the number of non-redundant free parameters, p, in the model (and the residual degrees of freedom are given by n-p). We will revisit the edf later in this workshop.
print(basic_summary$p.table) Estimate Std. Error t value Pr(>|t|) (Intercept) 8.550030 0.3655849 23.387258 1.717989e-76 x02 2.418682 0.5165515 4.682364 3.908046e-06 x03 4.486193 0.5156501 8.700072 9.124666e-17 x04 6.528518 0.5204234 12.544629 1.322632e-30 > print(basic_summary$s.table) edf Ref.df F p-value s(x1) 1.923913 2.406719 42.84268 1.076396e-19
In our basic model the edf of smooth function s(x1) is ~2, which suggests a non-linear curve. The plot of the model nicely illustrates the shape of this non-linear smoother:
We can add a second term, x2, but specify a linear relationship with y (that is, both linear and non-linear terms can be included in GAM). The new linear term, x2, will appear in the
p.table, for which a regression coefficient estimate is indicated. In the
s.table, we will once again find the non-linear smoother, s(x1), and its wiggleness parameter.
If we wish to explore whether the relationship between y and x2 is non-linear, we can model x2 as non-linear smooth term instead. As before, we can use an ANOVA to test if the smoothed term is necessary.
When more than one covariable is included in the model, as above, the fitted response can be partitioned into the contributions of each variable as shown. Here we can appreciate the varying magnitude of the effect of each variable; where the y-axis represents the contribution (effect) of each covariate to the fitted response, centered on 0. If the confidence intervals had overlapped with zero for certain values of x (or throughout the entire range), this would imply a non-significant effect at those x values (or of x in entirety). When the contribution for an individual covariate changes along the range of x-axis, the change in that covariate is associated with a change in the response.
Analysis of Deviance Table Model 1: y ~ x0 + s(x1) Model 2: y ~ x0 + s(x1) + x2 Model 3: y ~ x0 + s(x1) + s(x2) Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 394.08 5231.6 2 393.10 4051.3 0.97695 1180.2 < 2.2e-16 *** 3 385.73 1839.5 7.37288 2211.8 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The best fit model is the model with both smooth terms for x1 and x2.
Create two new models, with x3 as a linear and smoothed term. Use plots, coefficient tables and the anova function to determine if x3 is an important term to include.
There are two ways to include interactions between variables:
Where the 'by' argument lets you have a smooth term vary between different levels of a factor. We will examine this using our categorical variable x0 and ask whether the non-linear smoother s(x2) varies across different levels of x0. To determine whether smooth terms differ significantly among factor levels, we will use an ANOVA test on the interaction term.
categorical_interact <- gam(y~x0+s(x1)+s(x2,by=x0),data=gam_data) categorical_interact_summary <- summary(categorical_interact) print(categorical_interact_summary$s.table) plot(categorical_interact,page=1) # or alternatively: plot using vis.gam function, where theta is the degree rotation on the x-y plane vis.gam(categorical_interact,view=c("x2","x0"),theta=40,n.grid=500,border=NA) anova(two_smooth_model, categorical_interact,test="Chisq")
Visually, we see that the shape of smooth terms are comparable among the four levels of x0. The anova test confirms this as well (Deviance = 98.6, p = 0.2347).
Finally we'll look at the interactions two smooth terms, x1 and x2. Here the 'by' argument is dropped (both quantitative variables).
smooth_interact <- gam(y~x0+s(x1,x2),data=gam_data) smooth_interact_summary <- summary(smooth_interact) print(smooth_interact_summary$s.table) plot(smooth_interact,page=1,scheme=3) # plot(smooth_interact,page=1,scheme=1) will give a similar plot to the vis.gam() vis.gam(smooth_interact,view=c("x1","x2"),theta=40,n.grid=500,border=NA) anova(two_smooth_model,smooth_interact,test="Chisq")
The interaction between s(x1) and s(x2) is significant and the 2D plot nicely illustrates this non-linear interactions, where y is greatest at high values of x1 but low to mid-values of x2 (try playing around with
theta to rotate the figure around. If you plan to run a lot of these, delete the argument
n.grid=500 to make the plots run faster).
We won't go too much further into this in this workshop, but you should know you can expand on the basic model we've looked at today with:
We will first have a look at changing the basis, followed by a quick intro to using other distribution and GAMMs (Generalized Additive Mixed Models). Let's look at one place where knowing how to change the basis can help: cyclical data. That is, data where you want the predictor to match at the ends. Imagine you have a time series of climate data, broken down by monthly measurement, and you want to see if there's a time-trend in yearly temperature. We'll use the Nottingham temperature time series for this:
Using the nottem data, we have created three new vectors;
n_years corresponds to the number of years of data (20 years),
nottem_month is a categorical variable coding for the 12 months of the year, for every year sampled (sequence 1 to 12 repeated 20 times), and
nottem_year is a variable where the year corresponding to each month is provided.
Monthly data from the years 1920 through to 1940:
To model this, we need to use what's called a cyclical cubic spline, or “cc”, to model month effects as well as a term for year.
There is about 1-1.5 degree rise in temperature over the period, but within a given year there is about 20 degrees variation in temperature, on average. The actual data vary around these values and that is the unexplained variance. Here we can see one of the very interesting bonuses of using GAMs. We can either plot the response surface (fitted values) or the terms (contribution of each covariate) as shown here. You can imagine these as plots of the changing regression coefficients, and how their contribution (or effect size) varies over time. In the first plot, we see that positive contributions of temperature occurred post-1930.
Over longer time scales, for example using paleolimnological data, others (Simpson & Anderson 2009; Fig. 3c) have used GAMs to plot the contribution (effect) of temperature on algal assemblages in lakes, to illustrate how significant contributions only occurred during two extreme cold events (that is, the contribution is significant when the confidence intervals do not overlap zero, at around 300 and 100 B.C.). This allowed the authors to not only determine how much variance was explained by temperature over the last few centuries, but also pinpoint when in time this effect was significant. If of interest to you, the code to plot either the response surface (
type = “response”) or the terms (
type = “terms”) is given below. When terms is selected, you obtain the same figure as above.
To provide a brief overview on how to use GAMs when the response variable does not follow a normal distributions and is either count or proportion data (e.g., Gamma, binomial, Poisson, negative binomial), the example that follows uses a dataset where a binomial family distribution is needed and a non-linear relationship with the explanatory variable is evident. Here, the response variable represents the number of successes (event happened) versus failures over the course of an experiment.
'data.frame': 514 obs. of 4 variables: $ prop : num 1 1 1 1 0 1 1 1 1 1 ... $ total: int 4 20 20 18 18 18 20 20 20 20 ... $ x1 : int 550 650 750 850 950 650 750 850 950 550 ... $ fac : Factor w/ 4 levels "f1","f2","f3",..: 1 1 1 1 1 1 1 1 1 1 ...
prop is the response variable, and is equal to the proportion of successes/ (successes + failures). Note that there are numerous cases where the proportion equals 1 or 0 which indicates that the outcomes were always successes or failures, respectively, at a given time point in the experiment.
x1 is the time since the start of experiment (our explanatory variable).
total represents the number of successes + failures observed at any time point of the experiment.
fac is a factor coding for trials 1 through 4 of the experiment (we will not use this variable in this section).
Let's start by visualizing the data. We are interested in the number of successes in comparison to failures as x1 increases. We will calculate the grand averages of the proportions of successes per time bin (x1) given that there are repeated measures per x1 value (numerous trials and observations per trial).
As x1 increases, so does the probability of successes. Would you say that this trend is linear or non-linear? We will test this using a logistic GAM (we use a
binomial family distribution given that our response is proportion data).
Estimate Std. Error z value Pr(>|z|) (Intercept) 1.173978 0.02709613 43.32641 0
edf Ref.df Chi.sq p-value s(x1) 4.591542 5.615235 798.9407 2.027751e-164
What does the intercept represent in this model?
- If successes = failures, the ratio is 1 and the logit is 0 (log(1) = 0).
- If successes have a larger count than failures, the ratio is greater than 1 and the logit has a positive value (e.g., log(2) = 0.69).
- If successes have a smaller count than failures, the ratio is lower than 1 and the logit has a negative value (e.g., log(0.5) = -0.69).
Thus, the intercept is the log odds ratio of successes to failures (logit), and indicates whether on average there are more successes than failures. Here, the estimated intercept coefficient is positive, which means that there are more successes than failures overall.
What does the smooth term indicate?
Lastly, we will see the different ways this relationship could be represented graphically.
par(mfrow=c(1,2)) plot(prop_model, select=1, scale=0, shade=TRUE) abline(h=0) out <- plot_smooth(prop_model, view="x1",main="") (diff <- find_difference(out$fv$fit, out$fv$CI, xVals=out$fv$x1)) addInterval(0, lowVals=diff$start, highVals = diff$end, col='red', lwd=2) abline(v=c(diff$start, diff$end), lty=3, col='red') text(mean(c(diff$start, diff$end)), 2.1, "sign. more \n success", col='red', font=3)
What do these plots tell us about successes versus failures?
Lastly, to help interpret the results, we could transform the summed effects back to proportions with the function
plot_smooth from the itsadug package:
As we already derived from the logit plot, we see that at around x1=400 the proportion of successes increases significantly above 0.5.
When observations are not independent, GAMs can be used to either incorporate:
That is, in addition to changing the basis as with the nottem example above, we can also add complexity to the model by incorporating an autocorrelation structure or mixed effects using the gamm or the gamm4 package. To start, let's have a look at the first case; a model with temporal autocorrelation in the residuals. We will revisit the Nottingham temperature model and test for correlated errors using the (partial) autocorrelation function.
The autocorrelation function plot (ACF; first panel above) provides the cross-correlation of a time series with itself at different points in time (i.e. similarity between observations at increasingly large time lags). In contrast, the partial autocorrelation function (PACF) gives the partial correlation of a time series with its own lagged values after controlling for the values of the time series at all shorter lags. The ACF and pACF plots are thus used to identify after how many time steps do observations start to be independent from one another. The ACF plot of our model residuals suggests a significant lag of 1, and perhaps a lag of 2. Therefore, a low-order AR model is likely needed. We can test this by adding AR structures to the Nottingham temperature model, one with an AR(1) (correlation at 1 time step) and one with an AR(2) (correlation at two times steps), and test for the best-fit model using ANOVA.
year_gam <- gamm(nottem~s(nottem_year)+s(nottem_month, bs="cc")) year_gam_AR1 <- gamm(nottem~s(nottem_year)+s(nottem_month, bs="cc"), correlation = corARMA(form = ~ 1|nottem_year, p = 1)) year_gam_AR2 <- gamm(nottem~s(nottem_year)+s(nottem_month, bs="cc"), correlation = corARMA(form = ~ 1|nottem_year, p = 2)) anova(year_gam$lme,year_gam_AR1$lme,year_gam_AR2$lme)
Model df AIC BIC logLik Test L.Ratio p-value year_gam$lme 1 5 1109.908 1127.311 -549.9538 year_gam_AR1$lme 2 6 1101.218 1122.102 -544.6092 1 vs 2 10.689206 0.0011 year_gam_AR2$lme 3 7 1101.598 1125.962 -543.7988 2 vs 3 1.620821 0.2030
The AR(1) provides a significant increase in fit over the naive model (LRT = 10.69, p = 0.0011), but there is very little improvement in moving to the AR(2) (LRT = 1.62, p = 0.203). So it is best to include only the AR(1) structure in our model.
As we saw in the previous section,
bs specifies the type of underlying base function. For random intercepts and linear random slopes we use bs=“re”, but for random smooths we use bs=“fs”.
Three different types of random effects are distinguished when using GAMMs (where fac is a categorial variable coding for the random effect and x0 is a continuous fixed effect):
We will first examine a GAMM with only a random intercept. As before, we will use the
gamSim() function to automatically generate a dataset, this time with a random effect component generate, then run a model with a random intercept using fac as the random factor.
Note that there is now a smoother term for the random intercept in the summary table. You can plot and view the random intercepts for each level of fac as follows:
plot(gamm_intercept, select=2) # select=2 because the random effect appears as the second entry in the summary table.
We can also use the
plot_smooth function to visualize the model, which in contrast to the default
plot.gam, allows us to plot a smooth of the summed effects of a GAM (based on predictions) as we saw earlier, but in addition, optionally removes the random effects. Here, we will plot the summed effects for the x0 without random effects, and then plot the predictions of all four levels of the random fac effect:
par(mfrow=c(1,2), cex=1.1) plot_smooth(gamm_intercept, view="x0", rm.ranef=TRUE, main="intercept + s(x1)", rug=FALSE) plot_smooth(gamm_intercept, view="x0", cond=list(fac="1"), main="... + s(fac)", col='orange', ylim=c(8,21), rug=FALSE) plot_smooth(gamm_intercept, view="x0", cond=list(fac="2"), add=TRUE, col='red') plot_smooth(gamm_intercept, view="x0", cond=list(fac="3"), add=TRUE, col='purple') plot_smooth(gamm_intercept, view="x0", cond=list(fac="4"), add=TRUE, col='turquoise')
Next we will run and plot a model with random slopes:
gamm_slope <- gam(y ~ s(x0) + s(x0, fac, bs="re"), data=gam_data2) summary(gamm_slope)$s.table plot_smooth(gamm_slope, view="x0", rm.ranef=TRUE, main="intercept + s(x0)", rug=FALSE) plot_smooth(gamm_slope, view="x0", cond=list(fac="1"), main="... + s(fac)", col='orange',ylim=c(7,22), rug=FALSE) plot_smooth(gamm_slope, view="x0", cond=list(fac="2"), add=TRUE, col='red') plot_smooth(gamm_slope, view="x0", cond=list(fac="3"), add=TRUE, col='purple') plot_smooth(gamm_slope, view="x0", cond=list(fac="4"), add=TRUE, col='turquoise')
We will now include both a random intercept and slope term.
gamm_int_slope <- gam(y ~ s(x0) + s(fac, bs="re") + s(fac, x0, bs="re"), data=gam_data2) summary(gamm_int_slope)$s.table plot_smooth(gamm_int_slope, view="x0", rm.ranef=TRUE, main="intercept + s(x0)", rug=FALSE) plot_smooth(gamm_int_slope, view="x0", cond=list(fac="1"), main="... + s(fac) + s(fac, x0)", col='orange', ylim=c(7,22), rug=FALSE) plot_smooth(gamm_int_slope, view="x0", cond=list(fac="2"), add=TRUE, col='red', xpd=TRUE) plot_smooth(gamm_int_slope, view="x0", cond=list(fac="3"), add=TRUE, col='purple', xpd=TRUE) plot_smooth(gamm_int_slope, view="x0", cond=list(fac="4"), add=TRUE, col='turquoise', xpd=TRUE)
Note that the random slope is static in this case:
plot(gamm_int_slope, select=3) # select=3 because the random slope appears as the third entry in your summary table.
Lastly, we will examine a model with a random smooth.
Here, if the random slope varied along x0, we would see different curves for each level:
plot(gamm_smooth, select=1) # select=1 because the smooth slope appears as the first entry in your summary table.
All of the above models can in turn be compared using
anova() as in the previous sections to determine the best fit model.
We will now take a few minutes to look at what GAMs are doing behind the scenes. Lets first consider a model containing one smooth function of one covariate, xi:
To estimate the smooth function f, we need to represented the above equation in such a way that it becomes a linear model. This can be done by choosing a basis, bi(x), defining the space of functions of which f is an element:
A simple example: a polynomial basis
Suppose that f is believed to be a 4th order polynomial, so that the space of polynomials of order 4 and below contains f. A basis for this space would then be:
so that f(x) becomes:
and the full model now becomes:
The basis functions are each multiplied by a real valued parameter, βi, and are then summed to give the final curve f(x).
By varying the βi we can vary the form of f(x) to produce any polynomial function of order 4 or lower.
Another example: a cubic spline basis
A cubic spline is a curve constructed from sections of a cubic polynomial joined together so that they are continuous in value. Each section of cubic has different coefficients.
Here's a representation of a smooth function using a rank 5 cubic spline basis with knot locations at increments of 0.2:
In this example, the knots are evenly spaced through the range of observed x values. However, the choice of the degree of model smoothness is controlled by the the number of knots, which was arbitrary. Is there a better way to select the knot locations?
Controlling the degree of smoothing with penalized regression splines
Instead of controlling smoothness by altering the number of knots, we keep that fixed to size a little larger than reasonably necessary, and control the model’s smoothness by adding a “wiggleness” penalty.
So, rather than fitting the model by minimizing (as with least squares regression):
it can be fit by minimizing:
As goes to , the model becomes linear. The selection of the best fit smoothing parameter, , makes use of a cross-validation approach. If is too high then the data will be over smoothed, and if it is too low then the data will be under smoothed. Ideally, it would be good to choose so that the predicted f is as close as possible to f. A suitable criterion might be to choose to minimize:
Since f is unknown, M is estimated using a generalized cross validation technique that leaves out each datum from the data in turn and considers the average ability of models fitted to the remaining data to predict the left out datum (for further details, see Wood 2006).
Illustration of the principle behind cross validation
In the first panel, the curve fits many of the data poorly and does no better with the missing point. In the third panel, the curve fits the noise as well as the signal and the extra variability induced causes it to predict the missing datum rather poorly. In the second panel, however, we see that the curve fits the underlying signal quite well, smoothing through the noise and the missing datum is reasonably well predicted.
Brief note on effective degrees of freedom (edf)
How many degrees of freedom does a fitted GAM have?
Instead of providing the output of the cross-validation in terms of (some tuning parameter modulating the fit’s complexity), the GAM function in the mgcv package uses a term called the effective degrees of freedom (edf) as a consistent way to quantify model complexity (that is, to justify our intuition that the degrees of freedom tracks model complexity). Because the number of free parameters in GAMs is difficult to define, the edf are instead related to , where the effect of the penalty is to reduce the effective degrees of freedom. So let's say the arbitrarily large number of knots is set to k=10, then k-1 sets the upper limit on the degrees of freedom associated with a smooth term. This number then decreases as the penalty increases until the best fit penalty is found by cross-validation.
There's a great deal more out there on GAMs… this was just the very surface.
Simon Wood, the author of the mgcv package has a very useful website with introductory talks and notes on how to use GAMs:
He's also written a book, Generalized Additive Models: An Introduction with R, which we used as reference for this workshop.
Material from this workshop were also obtained from the following fantastic blogs and tutorials:
Finally, the help pages, available through ?gam in R, are an excellent resource.