library(tidyverse)
theme_set(theme_bw())
Mixture models
Exercices
1 Preliminary
Only functions from R
-base and stats (preloaded) are required plus packages from the tidyverse for data representation and manipulation.
2 Introduction to mixture models
The faithful
data consist of the waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.
data("faithful")
%>%
faithful ggplot() + aes(x=waiting) + geom_histogram() + xlab("waiting (mm)")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Fit variants of a mixture of two Gaussian distributions to the faithful data.
assuming different proportions, means and variances for the 2 distributions
assuming same variances
assuming same means
assuming same proportions
2.1 Hint: skeleton of EM for 1D mixture of Gaussian
We follow what has been done during the course: we define the function that computes the value of the likelihood for each point in each component. It will be used to compute the posterior probabilities in the E-step.
## likelihood (density) of each point in each component
## data (x) is fixed while theta is variable
<- function(theta, x) {
dcomponents mapply(
function(pi, mu, sigma) pi * dnorm(x, mu, sigma),
$pi, theta$mu, theta$sigma,
thetaSIMPLIFY = TRUE
) }
Then, we define an EM where the function that computes the M step is given as an argument:
<-
mixture_gaussian1D function(x, # a n-vector of data
# starting values of vector paramter
theta0, M_step = M_step_general, # an R function
max_iter = 100,
threshold = 1e-6) {
## initialization
<- length(x)
n <- dcomponents(theta0, x)
likelihoods <- numeric(max_iter)
deviance 1] <- -2 * sum(log(rowSums(likelihoods)))
deviance[
for (t in 1:max_iter) {
# E step
<- likelihoods / rowSums(likelihoods)
tau
# M step
<- M_step(tau, x)
theta
## Assessing convergence
<- dcomponents(theta, x)
likelihoods +1] <- - 2 * sum(log(rowSums(likelihoods)))
deviance[t
## prepare next iterations
if (abs(deviance[t + 1] - deviance[t]) < threshold)
break
}
list(theta = theta, deviance = deviance[t + 1])
}
3 Epilepsy data
The data seizures.csv
consists of daily counts of epileptic seizures for 6 patients.
<- readr::read_csv('../../data/seizures.csv')
seizures %>%
seizures ggplot() + aes(x = time, y = nsz) + geom_point() + facet_wrap( ~ id)
Select the id 12 and fit a Poisson distribution to the number of seizures for this patient.
Implement and use a EM algorithm for fitting a mixture of 2 Poisson distributions to this data
Compare the two models