Mixture models

Exercices

1 Preliminary

Only functions from R-base and stats (preloaded) are required plus packages from the tidyverse for data representation and manipulation.

library(tidyverse)
theme_set(theme_bw())

2 Introduction to mixture models

The faithful data consist of the waiting time between eruptions and the duration of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA.

data("faithful")
faithful %>% 
  ggplot() + aes(x=waiting) + geom_histogram() + xlab("waiting (mm)")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Fit variants of a mixture of two Gaussian distributions to the faithful data.

  1. assuming different proportions, means and variances for the 2 distributions

  2. assuming same variances

  3. assuming same means

  4. assuming same proportions

2.1 Hint: skeleton of EM for 1D mixture of Gaussian

We follow what has been done during the course: we define the function that computes the value of the likelihood for each point in each component. It will be used to compute the posterior probabilities in the E-step.

## likelihood (density) of each point in each component
## data (x) is fixed while theta is variable
dcomponents <- function(theta, x) {
    mapply(
      function(pi, mu, sigma) pi * dnorm(x, mu, sigma),
      theta$pi, theta$mu, theta$sigma,
      SIMPLIFY = TRUE
    )
}

Then, we define an EM where the function that computes the M step is given as an argument:

mixture_gaussian1D <- 
  function(x,      # a n-vector of data
           theta0, # starting values of vector paramter
           M_step    = M_step_general, # an R function
           max_iter  = 100,
           threshold = 1e-6) {

  ## initialization
  n <- length(x)
  likelihoods  <- dcomponents(theta0, x) 
  deviance     <- numeric(max_iter)
  deviance[1]  <- -2 * sum(log(rowSums(likelihoods)))

  for (t in 1:max_iter) {
    
    # E step
    tau <- likelihoods / rowSums(likelihoods)

    # M step
    theta <- M_step(tau, x)
    
    ## Assessing convergence
    likelihoods   <- dcomponents(theta, x)
    deviance[t+1] <- - 2 * sum(log(rowSums(likelihoods)))

    ## prepare next iterations
    if (abs(deviance[t + 1] - deviance[t]) < threshold)
      break
    
  }

  list(theta = theta, deviance = deviance[t + 1])
}

3 Epilepsy data

The data seizures.csv consists of daily counts of epileptic seizures for 6 patients.

seizures <- readr::read_csv('../../data/seizures.csv')
seizures %>% 
  ggplot() + aes(x = time, y = nsz) + geom_point() + facet_wrap( ~ id)

  1. Select the id 12 and fit a Poisson distribution to the number of seizures for this patient.

  2. Implement and use a EM algorithm for fitting a mixture of 2 Poisson distributions to this data

  3. Compare the two models