<- function(y,id,X,A=X,niter=50) {
em.lmem <- unique(id)
uid <- as.matrix(y)
y <- as.matrix(X)
X <- as.matrix(A)
A <- length(uid)
N <- length(y)
n <- ncol(A)
nb.eta
<- as.vector(solve(t(X)%*%X)%*%t(X)%*%y)
beta <- diag(rep(1,nb.eta))
Omega <- 1
sigma2 <- as.vector(y - X%*%beta)
z for (k in 1:niter) {
<- solve(Omega)
iO <- R <- C <- 0
T <- u <- NULL
mu for (i in uid ) {
<- which(id==i)
row.i <- X[row.i,]
Xi <- A[row.i,]
Ai <- t(Ai)%*%Ai
AAi <- z[row.i]
zi <- solve(AAi/sigma2 + iO)
Gammai <- (Gammai%*%t(Ai)%*%zi)/sigma2
mui <- c(mu, mui)
mu <- c(u, Ai%*%mui)
u <- Gammai + mui%*%t(mui)
Si <- R + Si
R <- T + sum(diag(Si%*%AAi))
T <- C + t(mui)%*%t(Ai)%*%zi
C
}<- as.vector(solve(t(X)%*%X)%*%t(X)%*%(y-u))
beta <- as.vector(y - X%*%beta)
z <- (sum(z^2) -2*C[1] + T)/n
sigma2 <- as.matrix(R/N)
Omega
}<- as.vector(y - X%*%beta)
z <- -0.5*n*log(2*pi)
LL for (i in uid ) {
<- which(id==i)
row.i <- A[row.i,]
Ai <- z[row.i]
zi <- Ai%*%Omega%*%t(Ai) + diag(sigma2, nrow=length(row.i))
Gi <- LL -0.5*log(det(Gi)) -0.5*t(zi)%*%solve(Gi)%*%zi
LL
}<- length(beta) + nb.eta*(nb.eta+1)/2 + 1
nb.param <- -2*LL + 2*nb.param
AIC <- -2*LL + log(n)*nb.param
BIC names(beta) <- colnames(X)
return(list(beta=beta, Omega=Omega, sigma2=sigma2, LL=c(logLik=LL, AIC=AIC, BIC=BIC)))
}
An EM Algorithm for Linear Mixed Effects Models
Lecture Notes
1 The model
Consider the following model:
y_i = X_i \, \beta + A_i \, \eta_i + \varepsilon_i \quad ; \quad 1 \leq i \leq N where
- y_i is a n_i-vector of observations for individual i
- X_i is a n_i \times p design matrix
- \beta is a p-vector of fixed effects
- \eta_i is a q-vector of random effects
- \varepsilon_i is a n_i-vector of residual errors
The random effects are normally distributed: \eta_i \sim^{\mathrm{iid}} \mathcal{N}(0_p \ , \ \Omega)
The vector of residual errors \varepsilon_i is also normally distributed. Furthermore the components \varepsilon_{ij} are supposed to be independent and identically distributed:
\varepsilon_i \sim \mathcal{N}(0_{n_i} \ , \ \sigma^2 I_{n_i})
Then, y_i is also normally distributed:
y_i \sim \mathcal{N}(X_i \beta \ , \ A_i \Omega A_i^\prime + \sigma^2 I_{n_i})
We can rewrite the model in matrix form for the whole data as follows:
y = X\beta + A\eta + \varepsilon
where
y = \left( \begin{array}{c} y_{1} \\ y_{2} \\ \vdots \\ y_{N} \end{array}\right) \quad ,\quad X = \left( \begin{array}{c} X_{1} \\ X_{2} \\ \vdots \\ X_{N} \end{array}\right) \quad , \quad A = \left( \begin{array}{cccc} A_{1} & 0 & \ldots & 0 \\ 0 & A_{2} & \ldots & 0 \\ \vdots &\vdots &\ddots &\vdots \\ 0&0&\ldots & A_{N} \end{array}\right) \quad ,\quad \eta = \left( \begin{array}{c} \eta_{1} \\ \eta_{2} \\ \vdots \\ \eta_{N} \end{array}\right) \quad ,\quad \varepsilon = \left( \begin{array}{c} \varepsilon_{1} \\ \varepsilon_{2} \\ \vdots \\ \varepsilon_{N} \end{array}\right)
2 Maximum likelihood estimation of the model parameters
2.1 Maximization of the complete likelihood
Let \theta = (\beta,\Omega,\sigma^2) be the set of model parameters.
If \eta is known, the ML estimator of \theta maximizes the complete log-likelihood
\begin{aligned} \log\ell_c(\theta) & = \log(\mathbb{P}(y, \eta \ ; \ \theta))\\ & = \log(\mathbb{P}(y | \eta \ ; \ \theta)) + \log(\mathbb{P}(\eta \ ; \ \theta)) \\ & = \log(\mathbb{P}(y | \eta \ ; \ \beta, \sigma^2)) + \log(\mathbb{P}(\eta \ ; \ \Omega)) \end{aligned}
Then, (\hat{\beta}_c, \hat{\sigma}_c^2) minimizes
-2\log(\mathbb{P}(y | \eta \ ; \ \beta, \sigma^2)) = n\log(2\pi\sigma^2) + \frac{\| y - X\beta - A\eta \|^2 }{\sigma^2}
where n = \sum_{i=1}^N n_i is the total number of observations, while \hat{\Omega}_c minimizes
-2\log(\mathbb{P}( \eta \ ; \ \Omega)) = N\log(2\pi) + N\log(|\Omega|) + \sum_{i=1}^N \eta_i^\prime \Omega^{-1} \eta_i
Then,
\begin{aligned} \hat{\beta}_c &= (X^\prime X)X^\prime (y - A\eta) \\ \hat{\Omega}_c &= \frac{1}{N} \sum_{i=1}^N \eta_i \eta_i^\prime \\ \hat{\sigma}_c^2 &= \frac{1}{n}\| y - X\hat{\beta}_c - A\eta \|^2 = \frac{1}{n} \left( \| y - X\hat{\beta}_c \|^2 + \|A\eta \|^2 - 2 <y - X\hat{\beta}_c, A\eta> \right) \end{aligned}
Remark that
\begin{aligned} \|A\eta \|^2 & = \sum_{i=1}^N \|A_i\eta_i \|^2 \\ &= \sum_{i=1}^N \eta_i^\prime A_i^\prime A_i \eta_i \\ &= \sum_{i=1}^N {\rm Tr}\left(\eta_i^\prime A_i^\prime A_i \eta_i \right) \\ &= \sum_{i=1}^N {\rm Tr}\left(A_i^\prime A_i \eta_i \eta_i^\prime \right) \end{aligned}
The set of individual statistics used for estimating \theta is therefore
S(y,\eta) = (\eta_1, \eta_2, \ldots, \eta_N, \eta_1 \eta_1^\prime, \eta_2 \eta_2^\prime, \ldots, , \eta_N \eta_N^\prime)
Indeed, the definition of (\hat{\beta}_c, \hat{\Omega}_c, \hat{\sigma}_c^2 ) above defines a function \hat{\Theta} such that
\hat{\theta}_c = \hat{\Theta}(S(y,\eta))
2.2 The EM algorithm
The maximum likelihood (ML) estimator of \theta maximizes the log-likelihood function defined as
\begin{aligned} \log\ell(\theta) & = \log(\mathbb{P}(y_1, y_2, \ldots , y_N \ ; \ \theta))\\ & = \sum_{i=1}^{N}\log(\mathbb{P}(y_i \ ; \ \theta))\\ &= \sum_{i=1}^{N} \left\{ -\frac{n_i}{2}\log(2\pi) - \frac{1}{2}\log(|A_i \Omega A_i^\prime + \sigma^2 I_{n_i}|) - \frac{1}{2}(y_i - X_i \beta)^\prime (A_i \Omega A_i^\prime + \sigma^2 I_{n_i})^{-1} (y_i - X_i \beta) \right\} \end{aligned}
When the random effects (\eta_i, 1\leq i \leq N) are unknown, the statistics S(y \eta) cannot be computed. Then, the idea of EM is to replace S(y,\eta) by its conditional expectation \mathbb{E}[S(y,\eta)|y ;\theta].
The problem is that this conditional expectation depends on the unknown parameter \theta. EM is therefore an iterative procedure, where, at iteration k:
- the E-step computes S_k(y) = \mathbb{E}[S(y,\eta)|y ;\theta_{k-1}]
- the M-step updates the parameter estimate: \theta_k = \hat\Theta(S_k(y))
Here, computing \mathbb{E}[S(y,\eta)|y ;\theta] reduces to computing \mathbb{E}[\eta_i|y_i ;\theta] and \mathbb{E}[\eta_i \eta_i^\prime|y_i ;\theta] for i=1, 2, \ldots, N.
Since the marginal distributions of y_i and \eta_i are both Gaussian, the conditional distribution of \eta_i is also Gaussian with a mean and a variance that can be computed. Indeed, from Bayes Theorem,
\begin{aligned} \mathbb{P}(\eta_i \, | \, y_i \, ; \, \theta) &= \frac{\mathbb{P}(y_i \, | \, \eta_i \, ; \, \theta)\mathbb{P}(\eta_i \, ; \, \theta)}{\mathbb{P}( y_i \, ; \, \theta)} \\ &= C_1 \times \exp\left\{-\frac{1}{2\sigma^2}\| y_i-X_i\beta-A_i\eta_i \|^2 -\frac{1}{2}\eta_i^\prime\Omega^{-1} \eta_i \right\} \\ & = C_2 \times \exp\left\{-\frac{1}{2}(\eta_i-\mu_i)^\prime\Gamma_i^{-1} (\eta_i-\mu_i) \right\} \end{aligned}
where
\Gamma_i = \left(\frac{A_i^\prime A_i}{\sigma^2} + \Omega^{-1}\right)^{-1} \quad ; \quad \mu_i = \frac{\Gamma_i A_i^\prime(y_i - X_i\beta)}{\sigma^2}
Then,
\begin{aligned} \mathbb{E}[\eta_i|y_i ;\theta] &= \mu_i \\ \mathbb{E}[\eta_i \eta_i^\prime|y_i ;\theta] &= \mathbb{V}[\eta_i|y_i ;\theta] + \mathbb{E}[\eta_i|y_i ;\theta] \mathbb{E}[\eta_i|y_i ;\theta]^\prime \\ &= \Gamma_i + \mu_i \mu_i^\prime \end{aligned}
Then, the k-th iteration of the EM algorithm for a linear mixed effects model consists in
- computing \mathbb{E}[\eta_i|y_i ;\theta_{k-1}] and \mathbb{E}[\eta_i\eta_i^\prime|y_i ;\theta_{k-1}] for i=1,2,\ldots,N ,
- computing \theta_{k}= (\beta_k, \Omega_k,\sigma_{k}^2) where
\begin{aligned} {\beta}_k &= (X^\prime X)X^\prime (y - A\mathbb{E}[\eta|y ;\theta_{k-1}]) \\ {\Omega}_k &= \frac{1}{N} \sum_{i=1}^N \mathbb{E}[\eta_i \eta_i^\prime|y ;\theta_{k-1}] \\ {\sigma}_k^2 &= \frac{1}{n} \left( \| y - X{\beta}_k \|^2 + \sum_{i=1}^N {\rm Tr}\left(A_i^\prime A_i \mathbb{E}[\eta_i \eta_i^\prime|y ;\theta_{k-1}] \right) - 2 \sum_{i=1}^N (y_i - X_i{\beta}_k)^\prime A_i \mathbb{E}[\eta|y ;\theta_{k-1}] \right) \end{aligned}
Of course, some arbitrary initial estimates \theta_0 should also be provided.
The following function returns the EM estimate \theta_K and the log-likelihood (\log(\mathbb{P}(y\ ; \ \theta_K), 1 \leq k \leq K):
2.3 A slightly simplified version of EM
By construction,
\begin{aligned} \mathbb{E}[ <y - X\hat{\beta} - A \eta, X \hat\beta > | y , \hat{\theta} ] &= <y - A \mathbb{E}[\eta | y , \hat{\theta}] - X\hat{\beta}, X \hat\beta > = 0 \end{aligned}
On the other hand,
\mathbb{E}[ <y - X\hat{\beta} - A \eta, A \eta > | y , \hat{\theta} ] = 0
Then, \mathbb{E}[\| A\eta \|^2 | y , \hat{\theta}] = <y - X\hat{\beta}, A \mathbb{E}[\eta | y , \hat{\theta}] > and
\begin{aligned} \hat{\sigma}^2 &= \frac{1}{n} \left( \| y - X\hat{\beta} \|^2 - <y - X\hat{\beta}, A \mathbb{E}[\eta|y ;\hat\theta]> \right) \\ &= \frac{1}{n} <y, y - X\hat{\beta} - A \mathbb{E}[\eta|y ;\hat\theta]> \end{aligned}
Implementation of EM is then simplified:
<- function(y,id,X,A=X,niter=50,C=NULL) {
em2.lmem <- unique(id)
uid <- as.matrix(y)
y <- as.matrix(X)
X <- as.matrix(A)
A <- length(uid)
N <- length(y)
n if (is.null(C))
<- matrix(1,ncol=ncol(A),nrow=ncol(A))
C
<- as.vector(solve(t(X)%*%X)%*%t(X)%*%y)
beta <- diag(rep(1,ncol(A)))
Omega <- 1
sigma2 <- NULL
LL for (k in 1:niter) {
<- solve(Omega)
iO <- as.vector(y - X%*%beta)
z <- 0
R <- NULL
u for (i in uid ) {
<- which(id==i)
row.i <- A[row.i,]
Ai <- z[row.i]
zi <- solve(t(Ai)%*%Ai/sigma2 + iO)
Gammai <- (Gammai%*%t(Ai)%*%zi)/sigma2
mui <- c(u, Ai%*%mui)
u <- R + Gammai + mui%*%t(mui)
R
}<- as.vector(solve(t(X)%*%X)%*%t(X)%*%(y-u))
beta <- as.matrix(R/N)*C
Omega <- mean(y*(y - X%*%beta - u))
sigma2
}names(beta) <- row.names(Omega)
return(list(beta=beta, Omega=Omega, sigma2=sigma2))
}
3 Application to rat weight data
3.1 Fitting a polynomial model
Let us use our EM algorithm with the rat weight data, by fitting a polynomial of degree 2 with individual coefficients to each individual series of weights.
<- read.csv(file="../../data/ratWeight.csv")
d 'week2'] <- d["week"]^2
d[<- cbind(1,d["week"],d["week2"])
X <- em.lmem(y=d["weight"],id=d[["id"]],X=X,A=X)
res.em1 print(res.em1)
$beta
1 week week2
169.030083 31.241507 -1.102003
$Omega
1 week week2
1 823.291463 284.806010 -9.3176817
week 284.806010 157.156986 -5.4430666
week2 -9.317682 -5.443067 0.2012885
$sigma2
[1] 66.24102
$LL
logLik AIC BIC
-8691.351 17402.701 17459.821
Les us check that the two versions of EM give the same results:
<- em2.lmem(y=d["weight"],id=d[["id"]],X=X,A=X)
res.em1b print(res.em1b)
$beta
1 week week2
169.030150 31.241562 -1.102004
$Omega
1 week week2
1 823.291479 284.806000 -9.3176812
week 284.806000 157.156984 -5.4430666
week2 -9.317681 -5.443067 0.2012885
$sigma2
[1] 66.241
We can compare these results with those provided by the lmer
function
library(lme4)
<- lmer(weight ~ week + week2 + (week + week2 |id), data=d, REML=FALSE)
r1 <- list(beta=fixef(r1), Omega=VarCorr(r1)$id[,], sigma2=attr(VarCorr(r1), "sc")^2)
res.lmer1 print(res.lmer1)
$beta
(Intercept) week week2
169.087783 31.268990 -1.102913
$Omega
(Intercept) week week2
(Intercept) 808.24868 280.687036 -9.1923803
week 280.68704 155.524942 -5.3886311
week2 -9.19238 -5.388631 0.1994466
$sigma2
[1] 66.37679
print(c(logLik=logLik(r1), AIC=AIC(r1), BIC=BIC(r1)))
logLik AIC BIC
-8691.365 17402.731 17459.851
Let us now fit a model assuming different coefficients for males and females:
<- cbind(1,d["week"],d["week2"],(d["gender"]=="Male"),d["week"]*(d["gender"]=="Male"),d["week2"]*(d["gender"]=="Male"))
X colnames(X) <- c("Intercept","week", "week2", "gender", "Male:week", "Male:week2")
<- em.lmem(y=d["weight"],id=d[["id"]],X=X,A=X[,1:3])
res.em2 print(res.em2)
$beta
Intercept week week2 gender Male:week Male:week2
142.7060852 19.9228293 -0.7266020 52.7738330 22.7150997 -0.7533026
$Omega
Intercept week week2
Intercept 127.3827390 -14.612490 0.61391916
week -14.6124903 28.483661 -1.17684359
week2 0.6139192 -1.176844 0.05979207
$sigma2
[1] 66.25732
$LL
logLik AIC BIC
-8480.746 16987.493 17061.749
<- lmer(weight ~ gender*week + gender*week2 + (week + week2 |id), data=d, REML=FALSE) r2
Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge with max|grad| = 1.00951 (tol = 0.002, component 1)
Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, : Model is nearly unidentifiable: very large eigenvalue
- Rescale variables?
<- list(beta=fixef(r2), Omega=VarCorr(r2)$id[,], sigma2=attr(VarCorr(r2), "sc")^2)
res.lmer2 print(res.lmer2)
$beta
(Intercept) genderMale week week2
142.7060852 52.7644663 19.9228293 -0.7266020
genderMale:week genderMale:week2
22.6914568 -0.7525101
$Omega
(Intercept) week week2
(Intercept) 144.4469725 -17.308995 0.7504677
week -17.3089953 28.766028 -1.1928112
week2 0.7504677 -1.192811 0.0606887
$sigma2
[1] 65.79438
print(c(logLik=logLik(r2), AIC=AIC(r2), BIC=BIC(r2)))
logLik AIC BIC
-8481.058 16988.116 17062.372
anova(r1,r2)
Data: d
Models:
r1: weight ~ week + week2 + (week + week2 | id)
r2: weight ~ gender * week + gender * week2 + (week + week2 | id)
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
r1 10 17403 17460 -8691.4 17383
r2 13 16988 17062 -8481.1 16962 420.61 3 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
3.2 Testing differences between control and GMO groups
Let us test if coefficients for the control and GMO male groups are different (intercepts are assumed to be the same).
Using the EM algorithm,
<- subset(d, gender=="Male")
dm <- cbind(1,dm["week"],dm["week2"],dm["week"]*(dm["regime"]=="GMO"),dm["week2"]*(dm["regime"]=="GMO"))
X <- em.lmem(y=dm["weight"],id=dm[["id"]],X=X[,1:3],A=X[,1:3]) #H0: no difference
em.H0 print(em.H0)
$beta
1 week week2
195.460168 42.647597 -1.480638
$Omega
1 week week2
1 154.907759 -29.912537 1.3426990
week -29.912537 49.469278 -2.0907193
week2 1.342699 -2.090719 0.1097153
$sigma2
[1] 103.8647
$LL
logLik AIC BIC
-4480.915 8981.830 9031.996
<- em.lmem(y=dm["weight"],id=dm[["id"]],X=X,A=X[,1:3]) #H1: different coefficients c1 and c2
em.H1 print(em.H1)
$beta
1 week week2 week week2
195.4576401 44.0082633 -1.5548673 -2.7121645 0.1479159
$Omega
1 week week2
1 154.731659 -29.443131 1.3185729
week -29.443131 47.596812 -1.9866343
week2 1.318573 -1.986634 0.1040984
$sigma2
[1] 103.8273
$LL
logLik AIC BIC
-4479.098 8982.196 9042.395
Or using the lmer
function
<- lmer(weight ~ week + week2 + (week + week2 |id), data=dm, REML=FALSE) lmer.H0
Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge with max|grad| = 0.687327 (tol = 0.002, component 1)
<- lmer(weight ~ week + week:regime + week2 + week2:regime + (week + week2 |id), data=dm, REML=FALSE) lmer.H1
Warning in checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge with max|grad| = 0.0982178 (tol = 0.002, component 1)
summary(lmer.H1)
Linear mixed model fit by maximum likelihood ['lmerMod']
Formula: weight ~ week + week:regime + week2 + week2:regime + (week +
week2 | id)
Data: dm
AIC BIC logLik deviance df.resid
8982.2 9042.4 -4479.1 8958.2 1103
Scaled residuals:
Min 1Q Median 3Q Max
-12.6683 -0.4326 0.0197 0.4653 6.1631
Random effects:
Groups Name Variance Std.Dev. Corr
id (Intercept) 155.5611 12.4724
week 47.6115 6.9001 -0.34
week2 0.1042 0.3228 0.33 -0.89
Residual 103.8052 10.1885
Number of obs: 1115, groups: id, 80
Fixed effects:
Estimate Std. Error t value
(Intercept) 195.45587 1.75361 111.459
week 43.90084 1.11954 39.213
week2 -1.55142 0.05600 -27.705
week:regimeGMO -2.55616 1.48377 -1.723
regimeGMO:week2 0.14290 0.07396 1.932
Correlation of Fixed Effects:
(Intr) week week2 wk:GMO
week -0.348
week2 0.354 -0.888
week:rgmGMO 0.000 -0.663 0.577
regmGMO:wk2 -0.001 0.579 -0.663 -0.873
optimizer (nloptwrap) convergence code: 0 (OK)
Model failed to converge with max|grad| = 0.0982178 (tol = 0.002, component 1)
anova(lmer.H0, lmer.H1)
Data: dm
Models:
lmer.H0: weight ~ week + week2 + (week + week2 | id)
lmer.H1: weight ~ week + week:regime + week2 + week2:regime + (week + week2 | id)
npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
lmer.H0 10 8981.9 9032.0 -4480.9 8961.9
lmer.H1 12 8982.2 9042.4 -4479.1 8958.2 3.6816 2 0.1587
We see that, according to AIC, BIC or LRT, H_0 cannot be rejected: based on this experiment, there is no good reason for concluding that the growth curves are different for the two groups.