Preliminaries

Remark

If you want to enjoy and discover subtle Rmarkdown syntax, you can access the source code of this document on the github repository: https://github.com/jchiquet/MAP573/blob/master/homework_1_reporting_correction.Rmd

Dependencies

Let us load a couple of packages useful in our analyses:

library(tidyverse)
library(simputation)

Part 1: Swirl - no correction

The first part of this assignment is to use the swirl package to continue your basic R training.

A bit long but essentially straightforward.

Part 2: Basic exploratory data analysis

The ‘fertility.csv’ data set contains the female fertility in Europe in 2012 for 39 European countries splitted by age groups 15-19, 20-24, 25-29, 30-34, 35-39, 40 and over. The fertility is given as the mean number of children born per 1000 women.

  1. Download and read the data set from the data directory of MAP573’s github repository (use, e.g. read.csv). Store it into a data frame called fertility.

I suggest two solutions, one in base R and another one using readr.

## base R
fertility <- read.csv("data/fertility.csv", row.names = 1, header = TRUE) %>% as_tibble()
## readr
fertility <- read_csv("data/fertility.csv") %>% select(-id)
  1. Display the first 10 rows in the R markdown report.

The kable function is useful to format table-like printing The formating is adapted to the type of output (PDF, HTML, doc).

fertility %>% head() %>% knitr::kable()
Pays Region Age Fertilite
Albanie Europe sud 15-19 20.8
Albanie Europe sud 20-24 107.1
Albanie Europe sud 25-29 126.7
Albanie Europe sud 30-34 72.3
Albanie Europe sud 35-39 24.4
Albanie Europe sud >=40 5.5
  1. Make a summary of the data and display it (command summary)

The tibble’s command glimpse allows us to have a quick overview of the structure of your table:

fertility %>% tibble::glimpse()
## Rows: 234
## Columns: 4
## $ Pays      <chr> "Albanie", "Albanie", "Albanie", "Albanie", "Albanie", "Alb…
## $ Region    <chr> "Europe sud", "Europe sud", "Europe sud", "Europe sud", "Eu…
## $ Age       <chr> "15-19", "20-24", "25-29", "30-34", "35-39", ">=40", "15-19…
## $ Fertilite <dbl> 20.8, 107.1, 126.7, 72.3, 24.4, 5.5, 8.0, 36.4, 78.9, 94.6,…

The command summary - when applied on a table like object - gives a summary statistics of each column, depending on its type (factor, numeric).

fertility %>% summary()
##      Pays              Region              Age              Fertilite     
##  Length:234         Length:234         Length:234         Min.   :  3.50  
##  Class :character   Class :character   Class :character   1st Qu.: 14.82  
##  Mode  :character   Mode  :character   Mode  :character   Median : 47.75  
##                                                           Mean   : 52.59  
##                                                           3rd Qu.: 84.97  
##                                                           Max.   :136.70
  1. Find the country and age group with the highest, smallest value in each Region

Let us try to answer this in a (more or less) tidy way:

fertility %>% 
  group_by(Region) %>% 
  summarise(
    `fertility min `     = min(Fertilite),
    `smallest (Country)` = Pays[which.min(Fertilite)],
    `smallest (Age)`     = Age[which.min(Fertilite)],
    `fertility max `     = max(Fertilite),
    `highest (Country)`  = Pays[which.max(Fertilite)],
    `highest (Age)`      = Age[which.max(Fertilite)]) %>% knitr::kable()
## `summarise()` ungrouping output (override with `.groups` argument)
Region fertility min smallest (Country) smallest (Age) fertility max highest (Country) highest (Age)
Europe est 3.8 Moldavie >=40 106.7 Bielorussie 25-29
Europe nord 4.4 Danemark 15-19 134.1 Suede 30-34
Europe Ouest 3.5 Suisse 15-19 136.7 Pays-Bas 30-34
Europe sud 4.4 Macedoine >=40 126.7 Albanie 25-29
  1. Display boxplots of the fertility for each age group

I first recode the levels in Age for presenting them in the more interpretable and natural order.

fertility %>% 
  mutate(Age = fct_relevel(Age, c("15-19", "20-24", "25-29", "30-34", "35-39", ">=40"))) %>%  
  ggplot() + aes(x = Age, y = Fertilite) + geom_boxplot()

Bonus: split by region.

fertility %>% 
  mutate(Age = fct_relevel(Age, c("15-19", "20-24", "25-29", "30-34", "35-39", ">=40"))) %>%  
  ggplot() + aes(x = Age, y = Fertilite, fill = Region) + 
    geom_boxplot() + geom_jitter(alpha = 0.5, size = 1)

  1. Display boxplots of the fertility for each ‘Region’ for the youngest age group

Same idea: use filter from dplyr:

fertility %>% 
  filter(Age == "15-19") %>%
  group_by(Region) %>%  # 
  ggplot() + aes(x = Region, y = Fertilite) + geom_boxplot()

  1. Extract all the data rows with Country starting by “A”

I had to make a research on the internet to find the appropriate “scoping” (function stringr::str_detect())

filter(fertility, str_detect(Pays, "^A")) %>% knitr::kable()
Pays Region Age Fertilite
Albanie Europe sud 15-19 20.8
Albanie Europe sud 20-24 107.1
Albanie Europe sud 25-29 126.7
Albanie Europe sud 30-34 72.3
Albanie Europe sud 35-39 24.4
Albanie Europe sud >=40 5.5
Allemagne Europe Ouest 15-19 8.0
Allemagne Europe Ouest 20-24 36.4
Allemagne Europe Ouest 25-29 78.9
Allemagne Europe Ouest 30-34 94.6
Allemagne Europe Ouest 35-39 50.9
Allemagne Europe Ouest >=40 9.3
Autriche Europe Ouest 15-19 8.6
Autriche Europe Ouest 20-24 44.5
Autriche Europe Ouest 25-29 89.2
Autriche Europe Ouest 30-34 94.6
Autriche Europe Ouest 35-39 46.8
Autriche Europe Ouest >=40 9.6
  1. Extract the fertility values of the Region from North Europe and make a histogram

Some filtering again, and adjusted binwidth (the default choice send warning)

fertility %>% 
  filter(Region == 'Europe nord') %>% 
  ggplot() + aes(x = Fertilite) + geom_histogram(binwidth = 10)

  1. Extract the fertility values of the Region from East Europe and plot a kernel density estimate (function density)

The function density is from the stats package, loaded at start-up. It produces an object for which a plot method is available:

fertility %>% 
  filter(Region == 'Europe est') %>% pull(Fertilite) %>% 
  density() %>% plot()

  1. Include in your report the mathematical expression of the kernel density estimator as given at https://en.wikipedia.org/wiki/Kernel_density_estimation

I just had to edit the wikipedia page, extract the formula and copy-paste. Here it is:

\[ \widehat{f}_h(x) = \frac{1}{n}\sum_{i=1}^n K_h (x - x_i) = \frac{1}{nh} \sum_{i=1}^n K\Big(\frac{x-x_i}{h}\Big), \]

Part 3: Basic handling of missing values

The ‘ozoneNA.csv’ data set contains 112 observations about air pollution collected during the summer of 2001 in Rennes. The aim of this study is to analyze the relationship between the maximum daily ozone level (maxO3) and the other meteorological variables. Hence, the variables recorded in this data set are the following:

  • maxO3 (maximum daily ozone)
  • maxO3v (maximum daily ozone the previous day)
  • T12 (temperature at midday)
  • T9 (Temp at 9am)
  • T15 (Temp at 3pm)
  • Vx12 (projection of the wind speed vector on the east-west axis at midday)
  • Vx9 and Vx15 (same at 9am and 3pm)
  • Ne9, Ne12, Ne15 Nebulosity (cloud)

This data is incomplete - there are missing values. Indeed, it occurs frequently to have machines that fail one day, leading to some information not recorded. In this exercice, we will perform extremely basic imputation of missing values for the variable maxO3, measuring the maximum value of ozone of the day.

  1. Import the data as a data frame named ozone.

The first column, which contains no header in the csv file, is an id from the date of sampling. We do not need it so we remove it just after reading the data:

ozone <- read_csv("data/ozoneNA.csv") %>% select(-X1)
## Warning: Missing column names filled in: 'X1' [1]
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   X1 = col_double(),
##   maxO3 = col_double(),
##   T9 = col_double(),
##   T12 = col_double(),
##   T15 = col_double(),
##   Ne9 = col_double(),
##   Ne12 = col_double(),
##   Ne15 = col_double(),
##   Vx9 = col_double(),
##   Vx12 = col_double(),
##   Vx15 = col_double(),
##   maxO3v = col_double(),
##   WindDirection = col_character()
## )
  1. Represent an histogram of the variable maxO3.
ozone %>% 
  ggplot() + aes(x = maxO3) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 16 rows containing non-finite values (stat_bin).

  1. Perform a very basic imputation by replacing NA values in maxO3 by the mean value. Create a vector maxO3_NA_mean.

Let us do this in base R. Notice the na.rm option in the call to function mean:

maxO3_NA_mean <- ozone$maxO3
maxO3_NA_mean[is.na(maxO3_NA_mean)] <- mean(maxO3_NA_mean, na.rm = TRUE)
  1. We suggest now to perform imputation via the k-nearest neighbor approach (see https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm ). Use the package simputation to perform knn-imputation of the missing-value in maxO3 based on a distance computed with all remaining variables. To do so, use
simputation::impute_knn(ozone, maxO3 ~ .)

where ozone is your data frame. The string maxO3 ~ . is a formula explaining that maxO3 is the variable on which imputation must be performed, and while ~ . says that all remaining variables of the data frame are used to perform the imputation. Create a vector maxO3_NA_knn with the vector with imputed values.

We apply the imputation method on the while data frame and extract the correspoding column to store the reslut in a simple numeric vector:

maxO3_NA_knn <- ozone %>% as.data.frame() %>% 
  simputation::impute_knn(maxO3 ~ .) %>% pull(maxO3)
  1. Compare visually the distributions of the three vectors maxO3,maxO3_NA_mean and maxO3_NA_knn.

We put all results together in the same data frame (remark the name exlicitly given), reshape it with pivot_longer a make an histogram for each. Remark the detrimental effect of mean imputation on the histogram, compare to KNN (far from ideal, though).

tibble(
  no_imputation   = ozone$maxO3,
  mean_imputation = maxO3_NA_mean, 
  knn_imputation  = maxO3_NA_knn
  ) %>% 
  pivot_longer(everything()) %>% 
  ggplot() + aes(x = value) + geom_histogram() + facet_grid(~ name)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 16 rows containing non-finite values (stat_bin).