If you want to enjoy and discover subtle Rmarkdown syntax, you can access the source code of this document on the github repository: https://github.com/jchiquet/MAP573/blob/master/homework_1_reporting_correction.Rmd
Let us load a couple of packages useful in our analyses:
library(tidyverse)
library(simputation)
The first part of this assignment is to use the swirl package to continue your basic R training.
A bit long but essentially straightforward.
The ‘fertility.csv’ data set contains the female fertility in Europe in 2012 for 39 European countries splitted by age groups 15-19, 20-24, 25-29, 30-34, 35-39, 40 and over. The fertility is given as the mean number of children born per 1000 women.
read.csv
). Store it into a data frame called fertility.I suggest two solutions, one in base R and another one using readr
.
## base R
<- read.csv("data/fertility.csv", row.names = 1, header = TRUE) %>% as_tibble()
fertility ## readr
<- read_csv("data/fertility.csv") %>% select(-id) fertility
The kable
function is useful to format table-like printing The formating is adapted to the type of output (PDF, HTML, doc).
%>% head() %>% knitr::kable() fertility
Pays | Region | Age | Fertilite |
---|---|---|---|
Albanie | Europe sud | 15-19 | 20.8 |
Albanie | Europe sud | 20-24 | 107.1 |
Albanie | Europe sud | 25-29 | 126.7 |
Albanie | Europe sud | 30-34 | 72.3 |
Albanie | Europe sud | 35-39 | 24.4 |
Albanie | Europe sud | >=40 | 5.5 |
summary
)The tibble’s command glimpse
allows us to have a quick overview of the structure of your table:
%>% tibble::glimpse() fertility
## Rows: 234
## Columns: 4
## $ Pays <chr> "Albanie", "Albanie", "Albanie", "Albanie", "Albanie", "Alb…
## $ Region <chr> "Europe sud", "Europe sud", "Europe sud", "Europe sud", "Eu…
## $ Age <chr> "15-19", "20-24", "25-29", "30-34", "35-39", ">=40", "15-19…
## $ Fertilite <dbl> 20.8, 107.1, 126.7, 72.3, 24.4, 5.5, 8.0, 36.4, 78.9, 94.6,…
The command summary
- when applied on a table like object - gives a summary statistics of each column, depending on its type (factor, numeric).
%>% summary() fertility
## Pays Region Age Fertilite
## Length:234 Length:234 Length:234 Min. : 3.50
## Class :character Class :character Class :character 1st Qu.: 14.82
## Mode :character Mode :character Mode :character Median : 47.75
## Mean : 52.59
## 3rd Qu.: 84.97
## Max. :136.70
Let us try to answer this in a (more or less) tidy way:
%>%
fertility group_by(Region) %>%
summarise(
`fertility min ` = min(Fertilite),
`smallest (Country)` = Pays[which.min(Fertilite)],
`smallest (Age)` = Age[which.min(Fertilite)],
`fertility max ` = max(Fertilite),
`highest (Country)` = Pays[which.max(Fertilite)],
`highest (Age)` = Age[which.max(Fertilite)]) %>% knitr::kable()
## `summarise()` ungrouping output (override with `.groups` argument)
Region | fertility min | smallest (Country) | smallest (Age) | fertility max | highest (Country) | highest (Age) |
---|---|---|---|---|---|---|
Europe est | 3.8 | Moldavie | >=40 | 106.7 | Bielorussie | 25-29 |
Europe nord | 4.4 | Danemark | 15-19 | 134.1 | Suede | 30-34 |
Europe Ouest | 3.5 | Suisse | 15-19 | 136.7 | Pays-Bas | 30-34 |
Europe sud | 4.4 | Macedoine | >=40 | 126.7 | Albanie | 25-29 |
I first recode the levels in Age for presenting them in the more interpretable and natural order.
%>%
fertility mutate(Age = fct_relevel(Age, c("15-19", "20-24", "25-29", "30-34", "35-39", ">=40"))) %>%
ggplot() + aes(x = Age, y = Fertilite) + geom_boxplot()
Bonus: split by region.
%>%
fertility mutate(Age = fct_relevel(Age, c("15-19", "20-24", "25-29", "30-34", "35-39", ">=40"))) %>%
ggplot() + aes(x = Age, y = Fertilite, fill = Region) +
geom_boxplot() + geom_jitter(alpha = 0.5, size = 1)
Same idea: use filter
from dplyr:
%>%
fertility filter(Age == "15-19") %>%
group_by(Region) %>% #
ggplot() + aes(x = Region, y = Fertilite) + geom_boxplot()
I had to make a research on the internet to find the appropriate “scoping” (function stringr::str_detect()
)
filter(fertility, str_detect(Pays, "^A")) %>% knitr::kable()
Pays | Region | Age | Fertilite |
---|---|---|---|
Albanie | Europe sud | 15-19 | 20.8 |
Albanie | Europe sud | 20-24 | 107.1 |
Albanie | Europe sud | 25-29 | 126.7 |
Albanie | Europe sud | 30-34 | 72.3 |
Albanie | Europe sud | 35-39 | 24.4 |
Albanie | Europe sud | >=40 | 5.5 |
Allemagne | Europe Ouest | 15-19 | 8.0 |
Allemagne | Europe Ouest | 20-24 | 36.4 |
Allemagne | Europe Ouest | 25-29 | 78.9 |
Allemagne | Europe Ouest | 30-34 | 94.6 |
Allemagne | Europe Ouest | 35-39 | 50.9 |
Allemagne | Europe Ouest | >=40 | 9.3 |
Autriche | Europe Ouest | 15-19 | 8.6 |
Autriche | Europe Ouest | 20-24 | 44.5 |
Autriche | Europe Ouest | 25-29 | 89.2 |
Autriche | Europe Ouest | 30-34 | 94.6 |
Autriche | Europe Ouest | 35-39 | 46.8 |
Autriche | Europe Ouest | >=40 | 9.6 |
Some filtering again, and adjusted binwidth (the default choice send warning)
%>%
fertility filter(Region == 'Europe nord') %>%
ggplot() + aes(x = Fertilite) + geom_histogram(binwidth = 10)
density
)The function density
is from the stats package, loaded at start-up. It produces an object for which a plot method is available:
%>%
fertility filter(Region == 'Europe est') %>% pull(Fertilite) %>%
density() %>% plot()
I just had to edit the wikipedia page, extract the formula and copy-paste. Here it is:
\[ \widehat{f}_h(x) = \frac{1}{n}\sum_{i=1}^n K_h (x - x_i) = \frac{1}{nh} \sum_{i=1}^n K\Big(\frac{x-x_i}{h}\Big), \]
The ‘ozoneNA.csv’ data set contains 112 observations about air pollution collected during the summer of 2001 in Rennes. The aim of this study is to analyze the relationship between the maximum daily ozone level (maxO3) and the other meteorological variables. Hence, the variables recorded in this data set are the following:
This data is incomplete - there are missing values. Indeed, it occurs frequently to have machines that fail one day, leading to some information not recorded. In this exercice, we will perform extremely basic imputation of missing values for the variable maxO3
, measuring the maximum value of ozone of the day.
ozone
.The first column, which contains no header in the csv
file, is an id from the date of sampling. We do not need it so we remove it just after reading the data:
<- read_csv("data/ozoneNA.csv") %>% select(-X1) ozone
## Warning: Missing column names filled in: 'X1' [1]
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## X1 = col_double(),
## maxO3 = col_double(),
## T9 = col_double(),
## T12 = col_double(),
## T15 = col_double(),
## Ne9 = col_double(),
## Ne12 = col_double(),
## Ne15 = col_double(),
## Vx9 = col_double(),
## Vx12 = col_double(),
## Vx15 = col_double(),
## maxO3v = col_double(),
## WindDirection = col_character()
## )
maxO3
.%>%
ozone ggplot() + aes(x = maxO3) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 16 rows containing non-finite values (stat_bin).
NA
values in maxO3
by the mean value. Create a vector maxO3_NA_mean
.Let us do this in base R
. Notice the na.rm
option in the call to function mean
:
<- ozone$maxO3
maxO3_NA_mean is.na(maxO3_NA_mean)] <- mean(maxO3_NA_mean, na.rm = TRUE) maxO3_NA_mean[
maxO3
based on a distance computed with all remaining variables. To do so, use::impute_knn(ozone, maxO3 ~ .) simputation
where ozone
is your data frame. The string maxO3 ~ .
is a formula explaining that maxO3
is the variable on which imputation must be performed, and while ~ .
says that all remaining variables of the data frame are used to perform the imputation. Create a vector maxO3_NA_knn
with the vector with imputed values.
We apply the imputation method on the while data frame and extract the correspoding column to store the reslut in a simple numeric vector:
<- ozone %>% as.data.frame() %>%
maxO3_NA_knn ::impute_knn(maxO3 ~ .) %>% pull(maxO3) simputation
maxO3
,maxO3_NA_mean
and maxO3_NA_knn
.We put all results together in the same data frame (remark the name exlicitly given), reshape it with pivot_longer
a make an histogram for each. Remark the detrimental effect of mean imputation on the histogram, compare to KNN (far from ideal, though).
tibble(
no_imputation = ozone$maxO3,
mean_imputation = maxO3_NA_mean,
knn_imputation = maxO3_NA_knn
%>%
) pivot_longer(everything()) %>%
ggplot() + aes(x = value) + geom_histogram() + facet_grid(~ name)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 16 rows containing non-finite values (stat_bin).