QCBS R Workshops

This series of 10 workshops walks participants through the steps required to use R for a wide array of statistical analyses relevant to research in biology and ecology. These open-access workshops were created by members of the Quebec Centre for Biodiversity Science both for members of the QCBS and the larger community.

The content of this workshop has been peer-reviewed by several QCBS members. If you would like to suggest modifications, please contact the current series coordinators, listed on the main wiki page

Workshop 8: Generalized additive models

Developed by: Eric Pedersen and Zofia Taranu

Summary: The goal of today's workshop will be to first examine what we mean by a non-linear model, and how GAMs (Generalized Additive Models) allow us to handle non-linear relationships. We will also go over how to plot and interpret these non-linear relationships, how to include interaction terms, autocorrelated errors and expand on previous workshop by briefly examining a mixed modelling framework. Lastly, we will briefly touch upon what GAMs are doing behind the scenes.

Link to associated Prezi: Prezi

Download the R script and data for this lesson:

  1. Use the mgcv package to fit non-linear relationships,
  2. Understand the output of a GAM to help you understand your data,
  3. Use tests to determine if a non-linear model fits better than a linear one,
  4. Include smooth interactions between variables,
  5. Understand the idea of a basis function, and why it makes GAMs so powerful,
  6. Account for dependence in data (autocorrelation, hierarchical structure) using GAMMs.

Prerequisites: Some experience in R (enough to be able to run a script and examine data and R objects) and a basic knowledge of regression (you should know what we mean by linear regression and ANOVA).

What do we mean by the linear model? Regression is the workhorse of statistics. It allows us to model a response variable as a function of predictors plus error. Linear regression is what most people first encounter in statistics. As we saw in the linear models workshop, regression makes four major assumptions:

  1. normally distributed error
  2. the variance of the errors doesn't change (homoscedastic)
  3. i.i.d. (each error is independent of the others)
  4. a linear response: {y} = {β_0} + {β_1}{x}

There's only one way for the linear model to be right:

And yet so many ways for it to fail:

So how can we fix it? We must first know what the regression model is trying to do:

  • fit a line that goes through the middle of the data,
  • do this without overfitting the data, such as simply using a line between each point and its neighbours.

Linear models do this by finding the best fit straight line that passes through the data. In contrast, additive models do this by fitting a curve through the data, but controlling how wiggly the line can get (more on this later).

Let's look at an example. First, we'll generate some data, and plot it.

| Generate and plot data
library(ggplot2)
set.seed(10)
n = 250
x = runif(n,0,5)
y_model = 3*x/(1+2*x)
y_obs = rnorm(n,y_model,0.1)
data_plot = qplot(x, y_obs) + 
            geom_line(aes(y=y_model)) + 
            theme_bw()
print(data_plot)

We can appreciate that if we were to fit this as a linear regression model, we would violate the assumptions listed above. Let's have a look, using the gam() command from the mgcv package here to model an ordinary least squares regression (we will see below how to use gam() to specify a smoothed, non-linear term).

| Linear model
library(mgcv)
linear_model = gam(y_obs~x)
model_summary=summary(linear_model)
print(model_summary)
data_plot = data_plot+
             geom_line(colour="red",
             aes(y=fitted(linear_model)))
print(data_plot)

We can see in the model summary that our linear model explains quite a bit of variance (R2(adj) = 0.639), however, diagnostic plots of the model residuals would quickly show that the error variance is not normally distributed nor homoscedastic, and that an important non-linear pattern remains. Let's now try to fix this by fitting the data using a smoothed (non-linear) term.

We will revisit this later, but briefly, GAMs are effectively a nonparametric form of regression where the beta*xi of a linear regression is replaced by an arbitrary smooth functions of the explanatory variables, f(xi), and the model becomes:

{y_i} = f(x_{i}) + {ε_i}

where yi is the response variable, xi covariate, and f is the smooth function.

Importantly, given that the smooth function f(xi) is non-linear and local, the magnitude of the effect of the explanatory variable can vary over its range, depending on the relationship between the variable and the response. That is, as opposed to one fixed coefficient beta, the function f can continually change over the range of xi. The degree of smoothness (or wiggliness) of f is controlled using penalized regression determined automatically in mgcv using a generalized cross-validation (GCV) routine (Wood 2006).

In gam() smooth terms are specified by expressions of the form: s(x) where x is the covariate which the smooth is a function of.

| GAM
gam_model = gam(y_obs~s(x))
summary(gam_model)
data_plot = data_plot +  
     geom_line(colour="blue",aes(y=fitted(gam_model)))
print(data_plot)

The variance explained by our model has increased by 20% (R2(adj) = 0.859) and when we compare the fit of the linear (red) and non-linear (blue) models, it is clear that the latter is the winner.

The mgcv package also includes a default plot to look at the smooths:

| smooth plot
plot(gam_model)

How do we test whether the non-linear model offers a significant improvement over the linear model? We can use the gam() and anova() commands to formally test whether an assumption of linearity is justified. To do, we must simply set our smoothed model so that it is nested in our linear model; that is, we create model object that includes both x (linear) and s(x) (non-linear) and we ask whether adding s(x) to the model with only x as a covariate is supported by the data.

| Test for linearity
linear_model = gam(y_obs~x)
nested_gam_model = gam(y_obs~s(x)+x)
print(anova(linear_model, nested_gam_model, test="Chisq"))

The non-linear term is significant:

 Analysis of Deviance Table
 Model 1: y_obs ~ x
 Model 2: y_obs ~ s(x) + x
       Resid. Df    Resid. Dev     Df        Deviance    Pr(>Chi)    
 1    248.00        6.5846                              
 2    240.68        2.4988         7.3168    4.0858      < 2.2e-16 ***
 ---
 Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Note that the model y_obs~s(x) gives exactly the same results as y_obs~s(x)+x, as shown above for the nested_gam_model. We used the s(x)+x to illustrate the nestedness of the model, however, we will omit the +x in the nested model comparisons that follow.


CHALLENGE 1

We will now try this test with some new simulated data, just to get a handle on it. First generate the data using the code below, then fit a linear and smoothed GAM model to the relation between x_test and y_test_obs. What is the effective degrees of freedom of the smoothed term? Determine if linearity is justified for this data.

Generate data
n <- 250
x_test <- runif(n,-5,5)
y_test_fit <- 4*dnorm(x_test)
y_test_obs <- rnorm(n,y_test_fit, 0.2)

Challenge 1 Solution

GAMs make it easy to include both smooth and linear terms, multiple smoothed terms, and smoothed interactions. For this section, we'll be using a dataset automatically generated by the gamSim() command of the mgcv package. For further details on what type of data you can simulate with this function, see ?gamSim. For this section, we will use the example 5 (eg=5) of gamSim, which simulates an additive example plus a factor variable.

| Generate non-linear data
gam_data = gamSim(eg=5)
head(gam_data) 

We will now look at how we can predict y based on the other variables. Start with a basic model, with one smoothed term (x1) and one categorical predictor (x0, which has 4 levels).

| Generate non-linear data
basic_model = gam(y~x0+s(x1), data= gam_data)
basic_summary = summary(basic_model)
print(basic_summary$p.table)
print(basic_summary$s.table)
plot(basic_model)

Here, the p.table provides the significance table for each parametric term and the s.table is the significance table for smooths. Note that for the latter, the wiggleness of the smoothed term s(x1) is indicated by the edf parameter (effective degrees of freedom); the higher the edf, the more non-linear the smoothing spline. A high value (8–10 or higher) means that the curve is highly non-linear, whereas a smoother with 1 effective degree of freedom is a straight line. In contrast, in linear regression the model degrees of freedom is equivalent to the number of non-redundant free parameters, p, in the model (and the residual degrees of freedom are given by n-p). We will revisit the edf later in this workshop.

 print(basic_summary$p.table)
             Estimate Std. Error   t value     Pr(>|t|)
 (Intercept) 8.550030  0.3655849 23.387258 1.717989e-76
 x02         2.418682  0.5165515  4.682364 3.908046e-06
 x03         4.486193  0.5156501  8.700072 9.124666e-17
 x04         6.528518  0.5204234 12.544629 1.322632e-30
 > print(basic_summary$s.table)
            edf   Ref.df        F      p-value
 s(x1) 1.923913 2.406719 42.84268 1.076396e-19

In our basic model the edf of smooth function s(x1) is ~2, which suggests a non-linear curve. The plot of the model nicely illustrates the shape of this non-linear smoother:

We can add a second term, x2, but specify a linear relationship with y (that is, both linear and non-linear terms can be included in GAM). The new linear term, x2, will appear in the p.table, for which a regression coefficient estimate is indicated. In the s.table, we will once again find the non-linear smoother, s(x1), and its wiggleness parameter.

| two term
two_term_model <- gam(y~x0+s(x1)+x2, data=gam_data)
two_term_summary <- summary(two_term_model)
print(two_term_summary$p.table)
print(two_term_summary$s.table)

If we wish to explore whether the relationship between y and x2 is non-linear, we can model x2 as non-linear smooth term instead. As before, we can use an ANOVA to test if the smoothed term is necessary.

| two smooth
two_smooth_model <- gam(y~x0+s(x1)+s(x2), data=gam_data)
two_smooth_summary <- summary(two_smooth_model)
print(two_smooth_summary$p.table)
print(two_smooth_summary$s.table)
plot(two_smooth_model,page=1)

When more than one covariable is included in the model, as above, the fitted response can be partitioned into the contributions of each variable as shown. Here we can appreciate the varying magnitude of the effect of each variable; where the y-axis represents the contribution (effect) of each covariate to the fitted response, centered on 0. If the confidence intervals had overlapped with zero for certain values of x (or throughout the entire range), this would imply a non-significant effect at those x values (or of x in entirety). When the contribution for an individual covariate changes along the range of x-axis, the change in that covariate is associated with a change in the response.

| ANOVA
anova(basic_model,two_term_model,two_smooth_model, test="Chisq")
  Analysis of Deviance Table
  Model 1: y ~ x0 + s(x1)
  Model 2: y ~ x0 + s(x1) + x2
  Model 3: y ~ x0 + s(x1) + s(x2)
    Resid. Df Resid. Dev      Df Deviance  Pr(>Chi)    
  1    394.08     5231.6                               
  2    393.10     4051.3 0.97695   1180.2 < 2.2e-16 ***
  3    385.73     1839.5 7.37288   2211.8 < 2.2e-16 ***
  ---
  Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The best fit model is the model with both smooth terms for x1 and x2.


CHALLENGE 2

Create two new models, with x3 as a linear and smoothed term. Use plots, coefficient tables and the anova function to determine if x3 is an important term to include.

Challenge 2 solution

There are two ways to include interactions between variables:

  • if one variable is smoothed, and the other isn't: using the 'by' argument,
  • fully smoothing both variables.

Where the 'by' argument lets you have a smooth term vary between different levels of a factor. We will examine this using our categorical variable x0 and ask whether the non-linear smoother s(x2) varies across different levels of x0. To determine whether smooth terms differ significantly among factor levels, we will use an ANOVA test on the interaction term.

| categorical interaction
categorical_interact <- gam(y~x0+s(x1)+s(x2,by=x0),data=gam_data)
categorical_interact_summary <- summary(categorical_interact)
print(categorical_interact_summary$s.table)
plot(categorical_interact,page=1)
# or alternatively: plot using vis.gam function, where theta is the degree rotation on the x-y plane
vis.gam(categorical_interact,view=c("x2","x0"),theta=40,n.grid=500,border=NA) 
anova(two_smooth_model, categorical_interact,test="Chisq")

Visually, we see that the shape of smooth terms are comparable among the four levels of x0. The anova test confirms this as well (Deviance = 98.6, p = 0.2347).

Finally we'll look at the interactions two smooth terms, x1 and x2. Here the 'by' argument is dropped (both quantitative variables).

| smooth interact
smooth_interact <- gam(y~x0+s(x1,x2),data=gam_data)
smooth_interact_summary <- summary(smooth_interact)
print(smooth_interact_summary$s.table)
plot(smooth_interact,page=1,scheme=3)
# plot(smooth_interact,page=1,scheme=1) will give a similar plot to the vis.gam()
vis.gam(smooth_interact,view=c("x1","x2"),theta=40,n.grid=500,border=NA) 
anova(two_smooth_model,smooth_interact,test="Chisq")

The interaction between s(x1) and s(x2) is significant and the 2D plot nicely illustrates this non-linear interactions, where y is greatest at high values of x1 but low to mid-values of x2 (try playing around with theta to rotate the figure around. If you plan to run a lot of these, delete the argument n.grid=500 to make the plots run faster).

We won't go too much further into this in this workshop, but you should know you can expand on the basic model we've looked at today with:

  • more complicated smooths, by changing the basis,
  • other distributions: anything you can do with a GLM using the family argument,
  • mixed effect models, using gamm or the gamm4 package.

We will first have a look at changing the basis, followed by a quick intro to using other distribution and GAMMs (Generalized Additive Mixed Models). Let's look at one place where knowing how to change the basis can help: cyclical data. That is, data where you want the predictor to match at the ends. Imagine you have a time series of climate data, broken down by monthly measurement, and you want to see if there's a time-trend in yearly temperature. We'll use the Nottingham temperature time series for this:

| cyclic basis
data(nottem)
n_years <- length(nottem)/12
nottem_month <- rep(1:12, times=n_years)
nottem_year <- rep(1920:(1920+n_years-1),each=12)
nottem_plot <- qplot(nottem_month,nottem, 
                    colour=factor(nottem_year), 
                    geom="line") + theme_bw()
print(nottem_plot)

Using the nottem data, we have created three new vectors; n_years corresponds to the number of years of data (20 years), nottem_month is a categorical variable coding for the 12 months of the year, for every year sampled (sequence 1 to 12 repeated 20 times), and nottem_year is a variable where the year corresponding to each month is provided.

Monthly data from the years 1920 through to 1940:

To model this, we need to use what's called a cyclical cubic spline, or “cc”, to model month effects as well as a term for year.

| cyclic gam
year_gam <- gam(nottem~s(nottem_year)+s(nottem_month, bs="cc"))
summary(year_gam)$s.table
plot(year_gam,page=1, scale=0)

There is about 1-1.5 degree rise in temperature over the period, but within a given year there is about 20 degrees variation in temperature, on average. The actual data vary around these values and that is the unexplained variance. Here we can see one of the very interesting bonuses of using GAMs. We can either plot the response surface (fitted values) or the terms (contribution of each covariate) as shown here. You can imagine these as plots of the changing regression coefficients, and how their contribution (or effect size) varies over time. In the first plot, we see that positive contributions of temperature occurred post-1930.

Over longer time scales, for example using paleolimnological data, others (Simpson & Anderson 2009; Fig. 3c) have used GAMs to plot the contribution (effect) of temperature on algal assemblages in lakes, to illustrate how significant contributions only occurred during two extreme cold events (that is, the contribution is significant when the confidence intervals do not overlap zero, at around 300 and 100 B.C.). This allowed the authors to not only determine how much variance was explained by temperature over the last few centuries, but also pinpoint when in time this effect was significant. If of interest to you, the code to plot either the response surface (type = “response”) or the terms (type = “terms”) is given below. When terms is selected, you obtain the same figure as above.

Contribution plot

To provide a brief overview on how to use GAMs when the response variable does not follow a normal distributions and is either count or proportion data (e.g., Gamma, binomial, Poisson, negative binomial), the example that follows uses a dataset where a binomial family distribution is needed and a non-linear relationship with the explanatory variable is evident. Here, the response variable represents the number of successes (event happened) versus failures over the course of an experiment.

Loading the data
gam_data3 <- read.csv("other_dist.csv")
summary(gam_data3)
str(gam_data3)
'data.frame':	514 obs. of  4 variables:
 $ prop : num  1 1 1 1 0 1 1 1 1 1 ...
 $ total: int  4 20 20 18 18 18 20 20 20 20 ...
 $ x1   : int  550 650 750 850 950 650 750 850 950 550 ...
 $ fac  : Factor w/ 4 levels "f1","f2","f3",..: 1 1 1 1 1 1 1 1 1 1 ...

prop is the response variable, and is equal to the proportion of successes/ (successes + failures). Note that there are numerous cases where the proportion equals 1 or 0 which indicates that the outcomes were always successes or failures, respectively, at a given time point in the experiment.
x1 is the time since the start of experiment (our explanatory variable).
total represents the number of successes + failures observed at any time point of the experiment.
fac is a factor coding for trials 1 through 4 of the experiment (we will not use this variable in this section).

Let's start by visualizing the data. We are interested in the number of successes in comparison to failures as x1 increases. We will calculate the grand averages of the proportions of successes per time bin (x1) given that there are repeated measures per x1 value (numerous trials and observations per trial).

Data visualization
emptyPlot(range(gam_data3$x1), c(0,1), h=.5,
          main="Probability of successes", ylab="Probability",xlab="x1")
 
avg <- aggregate(prop ~ x1, data=gam_data3, mean, na.rm=TRUE)
lines(avg$x1, avg$prop, col="orange",lwd=2)

As x1 increases, so does the probability of successes. Would you say that this trend is linear or non-linear? We will test this using a logistic GAM (we use a binomial family distribution given that our response is proportion data).

Logistic GAM
prop_model <- gam(prop~ s(x1), data=gam_data3, weights=total, family="binomial")
prop_summary <- summary(prop_model)
print(prop_summary$p.table)
print(prop_summary$s.table)
 
plot(prop_model)
             Estimate  Std. Error  z value   Pr(>|z|)
(Intercept)  1.173978  0.02709613  43.32641  0
        edf       Ref.df    Chi.sq     p-value
s(x1)   4.591542  5.615235  798.9407   2.027751e-164

What does the intercept represent in this model?

  • Recall that the model uses the count outcomes to calculate the logit, which is the log odds ratio between successes and failures:

logit = log({N_{success}/N_{failures}})

- If successes = failures, the ratio is 1 and the logit is 0 (log(1) = 0).
- If successes have a larger count than failures, the ratio is greater than 1 and the logit has a positive value (e.g., log(2) = 0.69).
- If successes have a smaller count than failures, the ratio is lower than 1 and the logit has a negative value (e.g., log(0.5) = -0.69).

Thus, the intercept is the log odds ratio of successes to failures (logit), and indicates whether on average there are more successes than failures. Here, the estimated intercept coefficient is positive, which means that there are more successes than failures overall.

What does the smooth term indicate?

  • This represents how the log odds of successes vs failures changes over x1 (time in this example). So, we see that since the edf>1, the proportion of successes increases faster over time (if for example the response represents the count of species A versus species B, and nutrient concentrations are increased over time, these results would indicate that species A is increasingly observed as nutrient concentrations approach its niche optima over the course of the experiment).

Visualizing the trend over time

Lastly, we will see the different ways this relationship could be represented graphically.

Plotting the trend
par(mfrow=c(1,2))
plot(prop_model, select=1, scale=0, shade=TRUE)
abline(h=0)
 
out <- plot_smooth(prop_model, view="x1",main="")
(diff <- find_difference(out$fv$fit, out$fv$CI, xVals=out$fv$x1))
addInterval(0, lowVals=diff$start, highVals = diff$end, col='red', lwd=2)
abline(v=c(diff$start, diff$end), lty=3, col='red')
text(mean(c(diff$start, diff$end)), 2.1, "sign. more \n success", col='red', font=3)

What do these plots tell us about successes versus failures?

  • Left plot: contribution/ partial effect (if we had more than one explanatory variable). Over time the log odds increases, so over time successes increase and failures decrease.
  • Right plot: fitted values, intercept included (summed effect if we had more than one explanatory variable included in the model). Here we see that the log odds ratio is estimated around zero at the start of the experiment; this means there are equal amounts of successes and failures. Gradually successes increase, and at around x1=400 there are significantly more successes than failures (the smooth is significantly different from zero). We also illustrated how we could use the plot to determine at which x1 value this occurs.

Lastly, to help interpret the results, we could transform the summed effects back to proportions with the function plot_smooth from the itsadug package:

Transformed plot
par(mfrow=c(1,1))
plot_smooth(prop_model, view="x1", main="",
            transform=plogis, ylim=c(0,1))
abline(h=.5, v=diff$start, col='red', lty=2)

As we already derived from the logit plot, we see that at around x1=400 the proportion of successes increases significantly above 0.5.

Dealing with non-independence

When observations are not independent, GAMs can be used to either incorporate:

  • a serial correlation structure to model residual autocorrelation (autoregressive AR, moving average MA, or a combination of the two ARMA),
  • random effects that model independence among observations from the same site.

That is, in addition to changing the basis as with the nottem example above, we can also add complexity to the model by incorporating an autocorrelation structure or mixed effects using the gamm or the gamm4 package. To start, let's have a look at the first case; a model with temporal autocorrelation in the residuals. We will revisit the Nottingham temperature model and test for correlated errors using the (partial) autocorrelation function.

Correlated errors
par(mfrow=c(1,2))
acf(resid(year_gam), lag.max = 36, main = "ACF")
pacf(resid(year_gam), lag.max = 36, main = "pACF")

The autocorrelation function plot (ACF; first panel above) provides the cross-correlation of a time series with itself at different points in time (i.e. similarity between observations at increasingly large time lags). In contrast, the partial autocorrelation function (PACF) gives the partial correlation of a time series with its own lagged values after controlling for the values of the time series at all shorter lags. The ACF and pACF plots are thus used to identify after how many time steps do observations start to be independent from one another. The ACF plot of our model residuals suggests a significant lag of 1, and perhaps a lag of 2. Therefore, a low-order AR model is likely needed. We can test this by adding AR structures to the Nottingham temperature model, one with an AR(1) (correlation at 1 time step) and one with an AR(2) (correlation at two times steps), and test for the best-fit model using ANOVA.

AR models
year_gam <- gamm(nottem~s(nottem_year)+s(nottem_month, bs="cc"))
year_gam_AR1 <- gamm(nottem~s(nottem_year)+s(nottem_month, bs="cc"),
                     correlation = corARMA(form = ~ 1|nottem_year, p = 1))
year_gam_AR2 <- gamm(nottem~s(nottem_year)+s(nottem_month, bs="cc"),
                     correlation = corARMA(form = ~ 1|nottem_year, p = 2))
anova(year_gam$lme,year_gam_AR1$lme,year_gam_AR2$lme)
                 Model df      AIC      BIC    logLik   Test   L.Ratio p-value
year_gam$lme         1  5 1109.908 1127.311 -549.9538                         
year_gam_AR1$lme     2  6 1101.218 1122.102 -544.6092 1 vs 2 10.689206  0.0011
year_gam_AR2$lme     3  7 1101.598 1125.962 -543.7988 2 vs 3  1.620821  0.2030

The AR(1) provides a significant increase in fit over the naive model (LRT = 10.69, p = 0.0011), but there is very little improvement in moving to the AR(2) (LRT = 1.62, p = 0.203). So it is best to include only the AR(1) structure in our model.

Mixed modelling

As we saw in the previous section, bs specifies the type of underlying base function. For random intercepts and linear random slopes we use bs=“re”, but for random smooths we use bs=“fs”.

Three different types of random effects are distinguished when using GAMMs (where fac is a categorial variable coding for the random effect and x0 is a continuous fixed effect):

  • random intercepts adjust the height of other model terms with a constant value: s(fac, bs=“re”)
  • random slopes adjust the slope of the trend of a numeric predictor: s(fac, x0, bs=“re”)
  • random smooths adjust the trend of a numeric predictor in a nonlinear way: s(x0, factor, bs=“fs”, m=1), where the argument m=1 sets a heavier penalty for the smooth moving away from 0, causing shrinkage to the mean.

We will first examine a GAMM with only a random intercept. As before, we will use the gamSim() function to automatically generate a dataset, this time with a random effect component generate, then run a model with a random intercept using fac as the random factor.

random intercept
# generate and view data
gam_data2 <- gamSim(eg=6)
head(gam_data2)
 
# run random intercept model
gamm_intercept <- gam(y ~ s(x0) + s(fac, bs="re"), data=gam_data2)
 
# examine model output
summary(gamm_intercept)$s.table

Note that there is now a smoother term for the random intercept in the summary table. You can plot and view the random intercepts for each level of fac as follows:

plot of random intercepts
plot(gamm_intercept, select=2) 
# select=2 because the random effect appears as the second entry in the summary table.

We can also use the plot_smooth function to visualize the model, which in contrast to the default plot.gam, allows us to plot a smooth of the summed effects of a GAM (based on predictions) as we saw earlier, but in addition, optionally removes the random effects. Here, we will plot the summed effects for the x0 without random effects, and then plot the predictions of all four levels of the random fac effect:

| plot factor levels
par(mfrow=c(1,2), cex=1.1)
plot_smooth(gamm_intercept, view="x0", rm.ranef=TRUE, main="intercept + s(x1)", rug=FALSE)
plot_smooth(gamm_intercept, view="x0", cond=list(fac="1"), 
            main="... + s(fac)", col='orange', ylim=c(8,21), rug=FALSE)
plot_smooth(gamm_intercept, view="x0", cond=list(fac="2"), add=TRUE, col='red')
plot_smooth(gamm_intercept, view="x0", cond=list(fac="3"), add=TRUE, col='purple')
plot_smooth(gamm_intercept, view="x0", cond=list(fac="4"), add=TRUE, col='turquoise')

Next we will run and plot a model with random slopes:

| random slope
gamm_slope <- gam(y ~ s(x0) + s(x0, fac, bs="re"), data=gam_data2)
summary(gamm_slope)$s.table
 
plot_smooth(gamm_slope, view="x0", rm.ranef=TRUE, main="intercept + s(x0)", rug=FALSE)
plot_smooth(gamm_slope, view="x0", cond=list(fac="1"), 
            main="... + s(fac)", col='orange',ylim=c(7,22), rug=FALSE)
plot_smooth(gamm_slope, view="x0", cond=list(fac="2"), add=TRUE, col='red')
plot_smooth(gamm_slope, view="x0", cond=list(fac="3"), add=TRUE, col='purple')
plot_smooth(gamm_slope, view="x0", cond=list(fac="4"), add=TRUE, col='turquoise')

We will now include both a random intercept and slope term.

random intercept and slope
gamm_int_slope <- gam(y ~ s(x0) + s(fac, bs="re") 
                      + s(fac, x0, bs="re"), data=gam_data2)
summary(gamm_int_slope)$s.table
 
plot_smooth(gamm_int_slope, view="x0", rm.ranef=TRUE, main="intercept + s(x0)", rug=FALSE)
plot_smooth(gamm_int_slope, view="x0", cond=list(fac="1"), 
            main="... + s(fac) + s(fac, x0)", col='orange', ylim=c(7,22), rug=FALSE)
plot_smooth(gamm_int_slope, view="x0", cond=list(fac="2"), add=TRUE, col='red', xpd=TRUE)
plot_smooth(gamm_int_slope, view="x0", cond=list(fac="3"), add=TRUE, col='purple', xpd=TRUE)
plot_smooth(gamm_int_slope, view="x0", cond=list(fac="4"), add=TRUE, col='turquoise', xpd=TRUE)

Note that the random slope is static in this case:

plot of random slopes
plot(gamm_int_slope, select=3) 
# select=3 because the random slope appears as the third entry in your summary table.

Lastly, we will examine a model with a random smooth.

random smooth
gamm_smooth <- gam(y ~ s(x0, fac, bs="fs", m=1), data=gam_data2)
summary(gamm_smooth)$s.table

Here, if the random slope varied along x0, we would see different curves for each level:

plot of random slopes
plot(gamm_smooth, select=1) 
# select=1 because the smooth slope appears as the first entry in your summary table.

All of the above models can in turn be compared using anova() as in the previous sections to determine the best fit model.

We will now take a few minutes to look at what GAMs are doing behind the scenes. Lets first consider a model containing one smooth function of one covariate, xi:

{y_i} = f(x_{i}) + {ε_i}

To estimate the smooth function f, we need to represented the above equation in such a way that it becomes a linear model. This can be done by choosing a basis, bi(x), defining the space of functions of which f is an element:

f(x) = sum{i=1}{q}{b_{i}(x)β_{i}}

A simple example: a polynomial basis

Suppose that f is believed to be a 4th order polynomial, so that the space of polynomials of order 4 and below contains f. A basis for this space would then be:

b_{1}(x)=1,   b_{2}(x)=x,   b_{3}(x)=x^{2},   b_{4}(x)=x^{3},   b_{5}(x)=x^{4}

so that f(x) becomes:

f(x) = β_{1} + x_{i}β_{2} +  x^{2}_{i}β_{3} + x^{3}_{i}β_{4}(x) + x^{4}_{i}β_{5}

and the full model now becomes:

y_{i} = β_{1} + x_{i}β_{2} +  x^{2}_{i}β_{3} + x^{3}_{i}β_{4}(x) + x^{4}_{i}β_{5} + ε_{i}

The basis functions are each multiplied by a real valued parameter, βi, and are then summed to give the final curve f(x).

By varying the βi we can vary the form of f(x) to produce any polynomial function of order 4 or lower.

Another example: a cubic spline basis

A cubic spline is a curve constructed from sections of a cubic polynomial joined together so that they are continuous in value. Each section of cubic has different coefficients.

Here's a representation of a smooth function using a rank 5 cubic spline basis with knot locations at increments of 0.2:

In this example, the knots are evenly spaced through the range of observed x values. However, the choice of the degree of model smoothness is controlled by the the number of knots, which was arbitrary. Is there a better way to select the knot locations?

Controlling the degree of smoothing with penalized regression splines

Instead of controlling smoothness by altering the number of knots, we keep that fixed to size a little larger than reasonably necessary, and control the model’s smoothness by adding a “wiggleness” penalty.

So, rather than fitting the model by minimizing (as with least squares regression):

||y - XB||^{2}

it can be fit by minimizing:

||y - XB||^{2} + {lambda}int{0}{1}{[f^{''}(x)]^{2}dx}

As lambda goes to infty, the model becomes linear. The selection of the best fit smoothing parameter, lambda, makes use of a cross-validation approach. If lambda is too high then the data will be over smoothed, and if it is too low then the data will be under smoothed. Ideally, it would be good to choose lambda so that the predicted f is as close as possible to f. A suitable criterion might be to choose lambda to minimize:

M = 1/n sum{i=1}{n}{(hat{f_{i}} - f_{i})^{2}}

Since f is unknown, M is estimated using a generalized cross validation technique that leaves out each datum from the data in turn and considers the average ability of models fitted to the remaining data to predict the left out datum (for further details, see Wood 2006).

Illustration of the principle behind cross validation

In the first panel, the curve fits many of the data poorly and does no better with the missing point. In the third panel, the curve fits the noise as well as the signal and the extra variability induced causes it to predict the missing datum rather poorly. In the second panel, however, we see that the curve fits the underlying signal quite well, smoothing through the noise and the missing datum is reasonably well predicted.

Brief note on effective degrees of freedom (edf)

How many degrees of freedom does a fitted GAM have?

Instead of providing the output of the cross-validation in terms of lambda (some tuning parameter modulating the fit’s complexity), the GAM function in the mgcv package uses a term called the effective degrees of freedom (edf) as a consistent way to quantify model complexity (that is, to justify our intuition that the degrees of freedom tracks model complexity). Because the number of free parameters in GAMs is difficult to define, the edf are instead related to lambda, where the effect of the penalty is to reduce the effective degrees of freedom. So let's say the arbitrarily large number of knots is set to k=10, then k-1 sets the upper limit on the degrees of freedom associated with a smooth term. This number then decreases as the penalty lambda increases until the best fit penalty is found by cross-validation.

There's a great deal more out there on GAMs… this was just the very surface.

Simon Wood, the author of the mgcv package has a very useful website with introductory talks and notes on how to use GAMs:

He's also written a book, Generalized Additive Models: An Introduction with R, which we used as reference for this workshop.

Material from this workshop were also obtained from the following fantastic blogs and tutorials:

Finally, the help pages, available through ?gam in R, are an excellent resource.