Statistical tests: exercices

1 Preliminary

Only functions from R-base and stats (preloaded) are required plus packages from the tidyverse for data representation and manipulation:

library(tidyverse)
theme_set(theme_bw())

2 Gene expression

The dataset geHT.csv consists of gene expression measurements for ten genes under control and treatment conditions, with four replicates each.

gene_data  <- readr::read_csv("../../data/geHT.csv")
controls   <- gene_data %>% dplyr::select(starts_with("c")) %>% unlist()
treatments <- gene_data %>% dplyr::select(starts_with("t")) %>% unlist()
  1. Plot the data using boxplots to compare control versus treatment

  2. Test the hypothesis that the mean of the control expression values is 2000.

Hint: Use the t.test function to perform the test (> ?t.test for more information about this function)

  1. Test that there is no difference overall between the treatments and controls for any of the genes (test that the whole experiment didn’t work or there are no differentially expressed genes)

Hint: Compare the results ignoring/knowing that the data are paired data.

  1. Test if the variances for the gene expression are the same under treatment or control conditions

Hint: Perform a F test to compare the two variances using the var.test function (> ?var.test for more information about this function)

  1. Considering that the measures of control and treatment are made on the same gene, do a paired t-test.

Hint: Use t.test with option paired = TRUE.

  1. For a significance level \alpha=10^{-10}, plot the power of the paired and Welch t-test for n varying from 3 to 20, taking the observed value of the statistic as \delta.

Hint: Use the power.t.test function.

  1. For n=10 samples, plot the power of the paired and Welch t-test for \alpha varying from $10^{-12} to $10^{-5}, taking the observed value of the statistic as \delta.

3 Smoking, no smoking

  1. There are 88 smokers among a group of 300 people of a same population. Test that the proportion of smokers in this population is less than or equal to 0.25, greater than or equal to 0.25, equal to 0.25. Show that we can use an exact test, or a test relying on an approximation.

Hint: What is the probability distribution of the number X_1 of smokers in this group of n_1=300 people ? Look at the binom.test and prop.test functions.

  1. There are 90 smokers in another group of 400 people, coming from another population. Can we conclude that the proportion of smokers are different in these two populations?

Hint: look at the fisher.test and prop.test functions.

## help in building thecontingency table
smokers     <- c(88, 90)
non.smokers <- c(300-88, 400-90)
smoke <- matrix(c(smokers,non.smokers), nrow = 2)
smoke
     [,1] [,2]
[1,]   88  212
[2,]   90  310

4 Efficacy of the BNT162b2 mRNA Covid-19 Vaccine

In an ongoing multinational, placebo-controlled, observer-blinded, pivotal efficacy trial, a total of 43,548 participants underwent randomization, of whom 43,448 received injections: 21,720 with BNT162b2 and 21,728 with placebo. There were 8 cases of Covid-19 with onset at least 7 days after the second dose among participants assigned to receive BNT162b2 and 162 cases among those assigned to placebo.

Based on this result, Pfizer concludes that BNT162b2 was 95% effective in preventing Covid-19.

  1. How was this value obtained?

Hint: The idea is to compare the observed number of BNT162b2 participants who were really infected with the expected number of BNT162b2 participants who would have been infected if the vaccine was not more effective than the placebo.

Let p_0 and p_1 be the probabilities to be infected in the placebo and BNT162b2 groups. How can p_0 and p_1 be estimated? How can the expected number of BNT162b2 participants be estimated under the null hypothesis (p_0=p_1)?

  1. Derive a 95% confidence interval for the vaccine efficacy VE using a normal approximation for \log(\hat{p}_1/\hat{p}_0) where \hat{p}_0 and \hat{p}_1 are the empirical proportions of cases of Covid-19 in the two groups,

Hint: What are the asymptotic distributions of \hat{p}_0 and \hat{p}_1? Using the delta-method, compute the asymptotic distribution for \log(\hat{p}_0) and \log(\hat{p}_1) and \log(\hat{p}_1/\hat{p}_0). Derive a confidence interval for \log(p_1/p0) and then for VE.

  1. Derive a 95% confidence interval for the vaccine efficacy approximating a confidence interval for {p}_1/{p}_0 by a prediction interval for \hat{p}_1/\hat{p}_0 and using a Monte-Carlo simulation for estimating this prediction interval.

Hint:

K <- 1000000
x0 <- rbinom(K, n0, hat.p0)
x1 <- rbinom(K, n1, hat.p1)
## just complete to estimate the target quantities

5 Identification of genes

Breast cancer is the most common malignant disease in Western women. In these patients, it is not the primary tumour, but its metastases at distant sites that are the main cause of death.

Prognostic markers are needed to identify patients who are at the highest risk for developing metastases, which might enable oncologists to begin tailoring treatment strategies to individual patients. Gene-expression signatures of primary breast tumours might be one way to identify the patients who are most likely to develop metastatic cancer.

The datafile geneMFS.csv contains the expression level of 11 genes and the metastasis-free survival (the period until metastasis is detected) for 527 patients.

cancer <- read_csv("../../data/geneMFS.csv") 

The objective of this study is to identify which genes may be good or poor prognosis for the development of metastasis.

  1. Graphically compare the distribution of the gene expressions in the groups of patients with early metastasis (MFS <1000) and late metastasis (MFS>1000).

Hint: Consider using scale_y_log10()

  1. Compare the gene expression levels in these two groups using a parametric test.

Hint: Perform a t-test for comparing the log-expression of each gene. Use the p.adjust function to apply the Bonferroni correction and the Benjamini-Hochberg correction to these p-values.

  1. Compare these results with those obtained using a non parametric test.

Hint: Perform Wilcoxon rank sum tests (using the wilcox.test function) instead of t tests.