```
setwd("F:/Google Drive/Courses and lab meetings/2018 courses/Statistical methods/entom4940/Mixed models")
```

```
data <-
"exerc1.csv" %>%
read.csv %>%
select(IQ, sex, age, ses, distcat, school, BMI, Pb)
```

# Introduction to mixed models

*“Mixed modeling is rarely, if ever, covered in even upper-level statistics courses. Trying to learn it on your own is like prepping to perform your own appendectomy – and just about as painful…”*

– https://thecraftofstatisticalanalysis.com/random-intercept-random-slope-models/

Mixed models are a form of linear modeling used for hierarchical data when the response variable has a normal distribution and the predictor variables are a mix of fixed and random effects. These models are also good when data points might not be fully independent of each other, for example students grouped into school or plants grouped into quadrats.

### Repeated measures

When analyzing data that involves repeated measures for the same subject, mixed models can be a better choice than a repeated measures ANOVA for a few reasons, including:

- A mixed model can handle missing values, but a repeated measures ANOVA must drop the subject entirely if it is missing even a single measurement.
- A mixed model can handle hierarchical clustering, but a repeated measures ANOVA cannot.
- Repeated measures can be spaced at irregular intervals when using a mixed model.

### Hierarchical data

Grouping factors like populations, species, sites. This is different from clustered data, where individual points may belong to more than one group at the same level.

An example of hierarchical data might be:

- students clustered

### Fixed and random effects

A mixed model is a type of linear model with multiple predictor variables.

`Y ~ B0 + B1 X1 + B2 X2 + ... + E`

The coefficients tell you how much the response variable changes for one unit of change in the predictor variables. `B0`

is the intercept and `E`

the error term – the amount of variability not explained by your model.

What makes something a mixed model is that the various effect terms (e.g. `B1`

, `B2`

) are a mix of fixed and random effects.

**Fixed effects** are the same no matter how much or where you sample. For example, `sex`

is a fixed effect because you have your values `male`

, `female`

, and `other`

, and no matter how much data you sample you’ll still have those same factor levels.

**Random effects** are factors for which the levels are samples from a larger population about which we’re trying to draw conclusions. Individual `subject ID`

, for example, is a random effect because a) you could go out and sample more individuals who would be different from the ones you have and b) you’re inferring something about the whole population from the sub-group included in your analysis.

Another way to think about random effects is that they capture variation that exists but that isn’t relevant to your question, like variation between individual subjects for which you have repeated measures. Where you have non-independence or pseudoreplication, include a random effect for that element even if it is not significant in the model, so that reviewers know you properly accounted for it.

### Interaction terms

An interaction occurs when one variable changes how another variable affects the response variable. When including an interaction term, be sure to leave in lower order terms for each interaction.

### Nested vs crossed

An example from https://www.theanalysisfactor.com/the-difference-between-crossed-and-nested-factors/:

Imagine you have 18 people assigned to two training groups, and their BMI is measured at three time points. There are both **crossed** and **nested** factors in this experiment.

Crossed: Subject is crossed with time because every combination of subject and time point is represented by a value of BMI.

Nested: Each subject is assigned to only one training group, and so not every possible combination of subject and group is represented by a value of BMI. By knowing the subject ID, you know exactly which training group they belong to. Thus subject is nested within training group.

For crossed factors, you can look at an interaction term. For nested factors, you can’t.

### Overfitting

Avoid overfitting, which occurs when your model has too small a sample size and too many predictor variables. If you do this, you’ll get a warning that your model wouldn’t converge.

# Data exploration and rescaling

Categorical variables need to be recoded as numeric before analyzing them (as.factor is one way to do this). For example, in this data, sex is already recoded from M/F to 0/1, and socioeconomic status (ses) from low, medium, and high to 0, 1, and 2.

Standardizing your explanatory variables to a mean of zero and standard deviation of one (using function `scale`

) is also useful. It makes it easier to compare effect sizes because all your estimated coefficients are on the same scale. Centering in this way also helps with interpreting the intercept of your model.

Let’s take a look at the data:

```
str(data)
```

## 'data.frame': 457 obs. of 8 variables: ## $ IQ : int 99 103 134 71 97 111 100 111 99 110 ... ## $ sex : int 1 1 0 0 0 0 0 1 1 1 ... ## $ age : num 6.67 7 7.17 6.92 8.5 ... ## $ ses : int 1 1 0 1 1 0 1 1 1 1 ... ## $ distcat: int 1 1 1 1 1 1 1 1 1 1 ... ## $ school : int 1 1 1 1 1 1 1 1 1 1 ... ## $ BMI : num 17 20 13 16.3 16.5 ... ## $ Pb : num 6.47 5.93 12.96 10.39 9.15 ...

```
summary(data)
```

## IQ sex age ses ## Min. : 55 Min. :0.000 Min. :6.167 Min. :0.000 ## 1st Qu.: 93 1st Qu.:0.000 1st Qu.:6.667 1st Qu.:1.000 ## Median :103 Median :0.000 Median :6.917 Median :1.000 ## Mean :102 Mean :0.442 Mean :6.945 Mean :0.831 ## 3rd Qu.:112 3rd Qu.:1.000 3rd Qu.:7.167 3rd Qu.:1.000 ## Max. :145 Max. :1.000 Max. :8.500 Max. :2.000 ## NA's :7 NA's :19 ## distcat school BMI Pb ## Min. :0.0000 Min. : 1.000 Min. :12.74 Min. : 1.99 ## 1st Qu.:0.0000 1st Qu.: 3.000 1st Qu.:15.26 1st Qu.: 7.60 ## Median :1.0000 Median : 7.000 Median :16.28 Median :10.57 ## Mean :0.5733 Mean : 7.112 Mean :16.94 Mean :11.98 ## 3rd Qu.:1.0000 3rd Qu.:11.000 3rd Qu.:17.83 3rd Qu.:14.30 ## Max. :1.0000 Max. :14.000 Max. :29.21 Max. :43.82 ## NA's :3 NA's :4

It is important to visually examine your data to ensure you select an appropriate analysis. For mixed models, your response variable should be normal, but your predictor variables don’t have to be. Check that the response variable is normal:

```
ggplot(data, aes(IQ)) + geom_histogram() + ggtitle("Distribution of the response variable")
```

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 7 rows containing non-finite values (stat_bin).

This looks pretty good. Now let’s check whether we want to rescale any of the variables. Age, BMI, and Pb are continuous variables and so seem like good candidates.

```
ggplot(data, aes(age)) + geom_histogram()
```

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

```
ggplot(data, aes(BMI)) + geom_histogram()
```

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 3 rows containing non-finite values (stat_bin).

```
ggplot(data, aes(Pb)) + geom_histogram()
```

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 4 rows containing non-finite values (stat_bin).

```
data <-
data %>%
mutate(
age = as.numeric(scale(age)),
BMI = as.numeric(scale(BMI)),
Pb = as.numeric(scale(Pb))
)
```

# Analysis

## Plain linear model

Does the distance category of a school from a foundry, which could potentially be causing lead pollution, reduce the IQ of schoolchildren going to that school?

Just looking at a linear model of data from 14 schools near the foundry, you would think so, but there are some problems with this analysis.

```
m.linear <- lm(IQ ~ sex + age + ses + Pb + distcat, data = data)
summary(m.linear)
```

## ## Call: ## lm(formula = IQ ~ sex + age + ses + Pb + distcat, data = data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -48.368 -9.048 0.653 9.698 37.630 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 99.170 1.727 57.410 < 2e-16 *** ## sex -2.821 1.448 -1.949 0.05200 . ## age -1.384 0.736 -1.880 0.06073 . ## ses 2.046 1.314 1.557 0.12029 ## Pb -2.135 0.814 -2.623 0.00903 ** ## distcat 4.388 1.649 2.662 0.00807 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 14.89 on 429 degrees of freedom ## (22 observations deleted due to missingness) ## Multiple R-squared: 0.08475, Adjusted R-squared: 0.07408 ## F-statistic: 7.945 on 5 and 429 DF, p-value: 3.566e-07

```
anova(m.linear)
```

## Analysis of Variance Table ## ## Response: IQ ## Df Sum Sq Mean Sq F value Pr(>F) ## sex 1 921 921.3 4.1559 0.042103 * ## age 1 1107 1106.5 4.9915 0.025986 * ## ses 1 946 945.8 4.2667 0.039465 * ## Pb 1 4262 4261.8 19.2251 1.463e-05 *** ## distcat 1 1570 1570.3 7.0837 0.008071 ** ## Residuals 429 95100 221.7 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This analysis yields incorrect standard errors because it violates the assumption that the residuals are independent. The subjects are grouped into schools in this dataset, and that grouping is not accounted for by a linear model. Note the inflated degrees of freedom for distcat that result from its pesudoreplication.

## Mixed modeling

Because the data are hierarchical, with multiple measurements at multiple levels of organization (e.g. school and student), a mixed model would be better here. We’ll use the `nlme`

package for our models, but the `lme4`

package has a very similar format.

To estimate the variablity in IQ within vs between schools, we can run a model with only the random effect `school`

. This lets the intercept vary with `school`

.

```
# using nlme package
m1 <- lme(IQ ~ 1, random = ~1|school, method = "REML", na.action = na.omit, data = data)
summary(m1)
```

## Linear mixed-effects model fit by REML ## Data: data ## AIC BIC logLik ## 3713.927 3726.249 -1853.964 ## ## Random effects: ## Formula: ~1 | school ## (Intercept) Residual ## StdDev: 5.290319 14.58551 ## ## Fixed effects: IQ ~ 1 ## Value Std.Error DF t-value p-value ## (Intercept) 102.5002 1.586903 436 64.59134 0 ## ## Standardized Within-Group Residuals: ## Min Q1 Med Q3 Max ## -3.48155658 -0.57505992 0.03681583 0.66896768 2.54776677 ## ## Number of Observations: 450 ## Number of Groups: 14

```
# using lme4 package
m2 <- lmer(IQ ~ 1 + (1|school), REML = TRUE, na.action = na.omit, data = data)
summary(m2)
```

## Linear mixed model fit by REML ['lmerMod'] ## Formula: IQ ~ 1 + (1 | school) ## Data: data ## ## REML criterion at convergence: 3707.9 ## ## Scaled residuals: ## Min 1Q Median 3Q Max ## -3.4816 -0.5751 0.0368 0.6690 2.5478 ## ## Random effects: ## Groups Name Variance Std.Dev. ## school (Intercept) 27.99 5.29 ## Residual 212.74 14.59 ## Number of obs: 450, groups: school, 14 ## ## Fixed effects: ## Estimate Std. Error t value ## (Intercept) 102.500 1.587 64.59

The `1`

s used in these models are constants used as placeholders when you aren’t using covariates there. The random covariate to the right of the bar (here, `school`

) allows variation in intercept with this covariate.

Parameters (like p-value, standard error, etc.) are only going to be estimated for the fixed effects, not the random effects.

Let’s add covariates to our model.

```
m3 <- lme(IQ ~ sex + age + ses + Pb + distcat, random = ~1|school, na.action = na.omit, data = data)
summary(m3)
```

## Linear mixed-effects model fit by REML ## Data: data ## AIC BIC logLik ## 3573.304 3605.795 -1778.652 ## ## Random effects: ## Formula: ~1 | school ## (Intercept) Residual ## StdDev: 3.847656 14.49946 ## ## Fixed effects: IQ ~ sex + age + ses + Pb + distcat ## Value Std.Error DF t-value p-value ## (Intercept) 99.79718 2.3134839 417 43.13718 0.0000 ## sex -2.71006 1.4212686 417 -1.90679 0.0572 ## age -1.37318 0.7231642 417 -1.89886 0.0583 ## ses 1.94821 1.2970263 417 1.50206 0.1338 ## Pb -1.94705 0.8068593 417 -2.41313 0.0162 ## distcat 4.18923 2.6462699 12 1.58307 0.1394 ## Correlation: ## (Intr) sex age ses Pb ## sex -0.295 ## age -0.031 0.011 ## ses -0.436 -0.020 0.052 ## Pb -0.232 0.080 -0.029 0.041 ## distcat -0.648 0.056 0.011 -0.039 0.276 ## ## Standardized Within-Group Residuals: ## Min Q1 Med Q3 Max ## -3.38994659 -0.63216092 0.06822743 0.66448634 2.32685110 ## ## Number of Observations: 435 ## Number of Groups: 14

The estimates for each fixed effect represents the change in the response variable when that fixed effect changes by one unit, while all other fixed effects are held constant at their mean value. The estimate for the intercept gives the value of the response variable when all the other fixed effects are at zero – which is why it’s helpful to have centered your predictors.

Check the estimated variance for the random effect. If it’s indistinguishable from zero, then that random effect doesn’t matter and you can do a regular linear model instead (which has only fixed effects).

Also check the correlation between fixed effects (not with the intercept) – if fixed effects are highly correlated, you have a situation called “multicollinearity.” These highly correlated effects should not be included in one model together. Instead, you can make separate models that include either one or the other, and compare the models using model comparison to decide which covariate to keep.

### Random slopes

What about situations when there is a different relationship between the predictor and response variables across the different levels of the random factor? You can allow each random factor to vary not only in intercept but also in slope.

Here we allow each school to have its own slope of age. This will often improve the model.

```
m4 <- lme(IQ ~ sex + age + ses + Pb + distcat, random = ~age|school, na.action = na.omit, data = data)
summary(m4)
```

## Linear mixed-effects model fit by REML ## Data: data ## AIC BIC logLik ## 3576.195 3616.809 -1778.097 ## ## Random effects: ## Formula: ~age | school ## Structure: General positive-definite, Log-Cholesky parametrization ## StdDev Corr ## (Intercept) 3.820315 (Intr) ## age 1.562607 -0.509 ## Residual 14.429239 ## ## Fixed effects: IQ ~ sex + age + ses + Pb + distcat ## Value Std.Error DF t-value p-value ## (Intercept) 100.23237 2.2887284 417 43.79391 0.0000 ## sex -2.83425 1.4207802 417 -1.99485 0.0467 ## age -1.61320 0.8498488 417 -1.89821 0.0584 ## ses 1.82189 1.2962083 417 1.40555 0.1606 ## Pb -1.91173 0.8047680 417 -2.37551 0.0180 ## distcat 3.69328 2.5751671 12 1.43419 0.1771 ## Correlation: ## (Intr) sex age ses Pb ## sex -0.298 ## age -0.150 0.016 ## ses -0.447 -0.020 0.047 ## Pb -0.235 0.075 -0.027 0.045 ## distcat -0.641 0.058 0.014 -0.029 0.284 ## ## Standardized Within-Group Residuals: ## Min Q1 Med Q3 Max ## -3.43016244 -0.62288497 0.07439884 0.66379767 2.27899865 ## ## Number of Observations: 435 ## Number of Groups: 14

### Interactions

You can also include interaction terms as fixed effects. When using an interaction term, remember to still include the components of the interaction as individual fixed effects.

```
m5 <- lme(IQ ~ sex + age + sex * age, random = ~1|school, na.action = na.omit, data = data)
summary(m5)
```

## Linear mixed-effects model fit by REML ## Data: data ## AIC BIC logLik ## 3705.574 3730.176 -1846.787 ## ## Random effects: ## Formula: ~1 | school ## (Intercept) Residual ## StdDev: 5.102384 14.51071 ## ## Fixed effects: IQ ~ sex + age + sex * age ## Value Std.Error DF t-value p-value ## (Intercept) 103.64198 1.6594041 433 62.45735 0.0000 ## sex -2.56997 1.3914538 433 -1.84696 0.0654 ## age -1.59131 0.9417289 433 -1.68977 0.0918 ## sex:age 0.16938 1.3936810 433 0.12153 0.9033 ## Correlation: ## (Intr) sex age ## sex -0.372 ## age -0.001 0.003 ## sex:age 0.002 -0.016 -0.679 ## ## Standardized Within-Group Residuals: ## Min Q1 Med Q3 Max ## -3.42321384 -0.60395169 0.06851778 0.67403992 2.41075782 ## ## Number of Observations: 450 ## Number of Groups: 14

This interaction term is not very significant and so should be removed from the model.

### Model comparison using p-values

Check whether specific fixed and random effects are needed by comparing models. Start with your most complex model containing all your effects of interest, then drop them one by one, making a new model each time. Drop the highest-order interaction terms with the highest-order p-value first.

```
m6 <- lme(IQ ~ sex + age + ses + distcat + sex * age, random = ~1|school, na.action = na.omit, data = data)
summary(m6)
```

## Linear mixed-effects model fit by REML ## Data: data ## AIC BIC logLik ## 3603.355 3635.903 -1793.678 ## ## Random effects: ## Formula: ~1 | school ## (Intercept) Residual ## StdDev: 4.061579 14.59614 ## ## Fixed effects: IQ ~ sex + age + ses + distcat + sex * age ## Value Std.Error DF t-value p-value ## (Intercept) 98.61655 2.3181073 420 42.54184 0.0000 ## sex -2.47411 1.4194741 420 -1.74298 0.0821 ## age -1.53614 0.9518906 420 -1.61377 0.1073 ## ses 1.92242 1.2993137 420 1.47957 0.1397 ## distcat 6.00745 2.6422159 12 2.27364 0.0422 ## sex:age 0.18848 1.4375219 420 0.13112 0.8957 ## Correlation: ## (Intr) sex age ses distct ## sex -0.283 ## age -0.036 0.006 ## ses -0.427 -0.018 0.054 ## distcat -0.632 0.035 0.013 -0.049 ## sex:age 0.006 -0.011 -0.662 -0.006 -0.002 ## ## Standardized Within-Group Residuals: ## Min Q1 Med Q3 Max ## -3.31719015 -0.59891749 0.05048137 0.69607666 2.37667763 ## ## Number of Observations: 438 ## Number of Groups: 14

```
m7 <- lme(IQ ~ sex + age + ses + distcat, random = ~1|school, na.action = na.omit, data = data)
summary(m7)
```

## Linear mixed-effects model fit by REML ## Data: data ## AIC BIC logLik ## 3603.935 3632.43 -1794.968 ## ## Random effects: ## Formula: ~1 | school ## (Intercept) Residual ## StdDev: 4.054413 14.57974 ## ## Fixed effects: IQ ~ sex + age + ses + distcat ## Value Std.Error DF t-value p-value ## (Intercept) 98.61437 2.3146960 421 42.60359 0.0000 ## sex -2.47208 1.4177821 421 -1.74362 0.0820 ## age -1.45352 0.7126513 421 -2.03960 0.0420 ## ses 1.92354 1.2978215 421 1.48213 0.1391 ## distcat 6.00839 2.6380622 12 2.27758 0.0419 ## Correlation: ## (Intr) sex age ses ## sex -0.283 ## age -0.042 -0.002 ## ses -0.428 -0.018 0.067 ## distcat -0.632 0.035 0.015 -0.049 ## ## Standardized Within-Group Residuals: ## Min Q1 Med Q3 Max ## -3.32311635 -0.60335879 0.04583759 0.69979123 2.37919401 ## ## Number of Observations: 438 ## Number of Groups: 14

```
m8 <- lme(IQ ~ sex + age + distcat, random = ~1|school, na.action = na.omit, data = data)
summary(m8)
```

## Linear mixed-effects model fit by REML ## Data: data ## AIC BIC logLik ## 3699.696 3724.298 -1843.848 ## ## Random effects: ## Formula: ~1 | school ## (Intercept) Residual ## StdDev: 4.187035 14.49864 ## ## Fixed effects: IQ ~ sex + age + distcat ## Value Std.Error DF t-value p-value ## (Intercept) 100.08993 2.1238104 434 47.12752 0.0000 ## sex -2.50206 1.3899761 434 -1.80008 0.0725 ## age -1.49630 0.6905278 434 -2.16689 0.0308 ## distcat 6.11356 2.6791162 12 2.28193 0.0415 ## Correlation: ## (Intr) sex age ## sex -0.320 ## age -0.014 -0.011 ## distcat -0.725 0.041 0.019 ## ## Standardized Within-Group Residuals: ## Min Q1 Med Q3 Max ## -3.33360320 -0.61168680 0.05185314 0.65908991 2.43283910 ## ## Number of Observations: 450 ## Number of Groups: 14

```
m9 <- lme(IQ ~ age + distcat, random = ~1|school, na.action = na.omit, data = data)
summary(m9)
```

## Linear mixed-effects model fit by REML ## Data: data ## AIC BIC logLik ## 3703.425 3723.938 -1846.712 ## ## Random effects: ## Formula: ~1 | school ## (Intercept) Residual ## StdDev: 4.261853 14.53057 ## ## Fixed effects: IQ ~ age + distcat ## Value Std.Error DF t-value p-value ## (Intercept) 98.87479 2.0397098 435 48.47493 0.0000 ## age -1.50892 0.6920584 435 -2.18034 0.0298 ## distcat 6.30695 2.7130164 12 2.32470 0.0384 ## Correlation: ## (Intr) age ## age -0.018 ## distcat -0.752 0.020 ## ## Standardized Within-Group Residuals: ## Min Q1 Med Q3 Max ## -3.4189986 -0.5865094 0.0714204 0.6599321 2.5134284 ## ## Number of Observations: 450 ## Number of Groups: 14

# Online resources

https://ase.tufts.edu/gsc/gradresources/guidetomixedmodelsinr/mixed%20model%20guide.html

https://gkhajduk.github.io/2017-03-09-mixed-models/

https://www.cscu.cornell.edu/workshops/multilevel.php

The CSCU website is the source for the dataset used in this script.