Preliminaries

Package requirements

We start by loading a couple of packages for data manipulation and representation. Please install them using install.packages(“name_of_the_pacakge”) if not already done.

library(tidyverse)    # advanced data manipulation and vizualisation
library(knitr)        # R notebook export and formatting 
library(ggsci)        # Fancy color palettes inspired by scientific journals
library(cowplot)      # To combine several plots on the same canvas

Exercise 1

Import data

This exercise relies on the native R dataset “UCBAdmissions” which stores student admissions at UC Berkeley for the six largest departments in 1973 in function of the student gender. Please note that the data is originally stored as a 3-dimensional array, which we convert into a data frame for the rest of the exercise.

#TO RUN 
UCBAdmissions #3-dim array
## , , Dept = A
## 
##           Gender
## Admit      Male Female
##   Admitted  512     89
##   Rejected  313     19
## 
## , , Dept = B
## 
##           Gender
## Admit      Male Female
##   Admitted  353     17
##   Rejected  207      8
## 
## , , Dept = C
## 
##           Gender
## Admit      Male Female
##   Admitted  120    202
##   Rejected  205    391
## 
## , , Dept = D
## 
##           Gender
## Admit      Male Female
##   Admitted  138    131
##   Rejected  279    244
## 
## , , Dept = E
## 
##           Gender
## Admit      Male Female
##   Admitted   53     94
##   Rejected  138    299
## 
## , , Dept = F
## 
##           Gender
## Admit      Male Female
##   Admitted   22     24
##   Rejected  351    317
UCBAdmissions <- as.data.frame(UCBAdmissions) #Convert into a data frame
head(UCBAdmissions)
##      Admit Gender Dept Freq
## 1 Admitted   Male    A  512
## 2 Rejected   Male    A  313
## 3 Admitted Female    A   89
## 4 Rejected Female    A   19
## 5 Admitted   Male    B  353
## 6 Rejected   Male    B  207

Question 1 : Have a closer look at the dataset : how many rows? How many variables? How many observations (i.e. number of applications)? How many different college departments are there? Briefly describe the variables.

UCBAdmissions %>% nrow() #24 rows
## [1] 24
UCBAdmissions %>% ncol() #4 variables
## [1] 4
UCBAdmissions %>% select(Dept) %>% unique() %>% nrow() #Number of departments
## [1] 6
sum(UCBAdmissions$Freq) #Total number of observations
## [1] 4526

Correction: The dataset comprises 24 rows and 4 variables : Admit (Admitted or Rejected), Gender (Male or Female), Dept (the university department) and Freq (the number of students). In total 4526 observations are stored (which is the sum of the column Freq). There are 6 different departments whose names go from A to F.

Data description

Question 2 : Using what you have learned about tidy data and the dplyr package, compute the total number of admissions and rejections per Gender as well as the admission rate (number of admissions divided by number of aplications) per Gender.

Do you see a difference of admission rates per Gender at Berkeley?

UCBAdmissions %>%
  group_by(Gender, Admit) %>%
  summarise(count = sum(Freq)) %>%
  group_by(Gender) %>%
  mutate(rate_per_gender = count / sum(count))
## # A tibble: 4 x 4
## # Groups:   Gender [2]
##   Gender Admit    count rate_per_gender
##   <fct>  <fct>    <dbl>           <dbl>
## 1 Male   Admitted  1198           0.445
## 2 Male   Rejected  1493           0.555
## 3 Female Admitted   557           0.304
## 4 Female Rejected  1278           0.696

Correction : It seem that there is a huge difference between male admissions (about 45% percent of applications) and female admissions (only 30% of applications).

Question 3 : Compute now the number of :

  1. Applications by departement
  2. Admission/rejection rates per department (without distinction on the gender).

Are there departments with more applications than others? Do the admission rates differ according to the department?

UCBAdmissions %>%
  group_by(Dept, Admit) %>%
  summarise(count = sum(Freq)) %>%
  group_by(Dept) %>%
  mutate(total_number_applications = sum(count)) %>%
  mutate(rate_per_dpt = count / total_number_applications) %>%
  filter(Admit == "Admitted") #No need to keep the rows for rejection since they do not add more information. 
## # A tibble: 6 x 5
## # Groups:   Dept [6]
##   Dept  Admit    count total_number_applications rate_per_dpt
##   <fct> <fct>    <dbl>                     <dbl>        <dbl>
## 1 A     Admitted   601                       933       0.644 
## 2 B     Admitted   370                       585       0.632 
## 3 C     Admitted   322                       918       0.351 
## 4 D     Admitted   269                       792       0.340 
## 5 E     Admitted   147                       584       0.252 
## 6 F     Admitted    46                       714       0.0644

Correction The number of applications per departement does not vary much : it goes from 584 (department E) to 933 (department A). There is yet a huge difference in admission rates according to the departments, since it goes from about 6% (department F) to 64% (department A).

Question 4 : Compute finally the rates of male/female applications per department (without taking into account the admission results).

Can you draw an hypothesis to explain the difference of rate admissions between male and female (question 2)?

UCBAdmissions %>%
  group_by(Dept, Gender) %>%
  summarise(count = sum(Freq)) %>%
  group_by(Dept) %>%
  mutate(total_number_applications = sum(count)) %>%
  mutate(gender_rate_per_dpt = count / sum(count))
## # A tibble: 12 x 5
## # Groups:   Dept [6]
##    Dept  Gender count total_number_applications gender_rate_per_dpt
##    <fct> <fct>  <dbl>                     <dbl>               <dbl>
##  1 A     Male     825                       933              0.884 
##  2 A     Female   108                       933              0.116 
##  3 B     Male     560                       585              0.957 
##  4 B     Female    25                       585              0.0427
##  5 C     Male     325                       918              0.354 
##  6 C     Female   593                       918              0.646 
##  7 D     Male     417                       792              0.527 
##  8 D     Female   375                       792              0.473 
##  9 E     Male     191                       584              0.327 
## 10 E     Female   393                       584              0.673 
## 11 F     Male     373                       714              0.522 
## 12 F     Female   341                       714              0.478

Correction There is a huge difference in application rates per department according to the Gender : males tend to apply more to departments A (88%) and B (95%) , women tend to apply more to departments C (64%) and E (67%). There is about a 50% gender rate application for departments D and F. When crossing these results with the results from the previous question, it seems that women tend to apply more to departments with lower application rates (30 and 25% respectively for department C and E). This may explain the huge difference in admission rates between men and women.

Fancy data representation

The paradox you have illustrated in the previous part is called the Simpson’s paradox. You can read more about it here.

The goal of this section is to illustrate this paradox by plotting the sex admission rate (in percentage) per department.

Question 5 Please observe the plot thereafter and reproduce it using ggplot2.

include_graphics("resources/sex_admission_rate.png")

Hints :

  1. The color palette is “JAMA”, which comes from the package ‘ggsci’ loaded at the beginning of the homework. You can use the function “scale_color_jama()” or “scale_fill_jama()”.

  2. You can use the function round to round the percentage of admission per sex.

  3. You can use the function paste0 to concatenate in a single string the number of applications (Freq) and the sex admission rate in percentage. The special character ‘\n’ is used for line break.

  4. To use geom bar with pre-computed Frequencies (and thus with an “x” and a “y” aesthetics) you should use stat = “identity”.

  5. To ask geom_text to use a dodge position, you can set the argument “position” to the value position_dodge(width =1).

  6. The argument “size” from the function geom_text enables to change the font size.

#Step 1 : compute the pourcentage of admission rate per gender and per  departments
UCBAdmissions_sexrate <- UCBAdmissions %>%
  group_by(Dept, Gender) %>%
  mutate(percent_admission_sex = round(100* Freq / sum(Freq))) %>%
  filter(Admit == "Admitted") %>%
  mutate(label = paste0("n = ", Freq, "\n(", percent_admission_sex,"%)"))

head(UCBAdmissions_sexrate)
## # A tibble: 6 x 6
## # Groups:   Dept, Gender [6]
##   Admit    Gender Dept   Freq percent_admission_sex label           
##   <fct>    <fct>  <fct> <dbl>                 <dbl> <chr>           
## 1 Admitted Male   A       512                    62 "n = 512\n(62%)"
## 2 Admitted Female A        89                    82 "n = 89\n(82%)" 
## 3 Admitted Male   B       353                    63 "n = 353\n(63%)"
## 4 Admitted Female B        17                    68 "n = 17\n(68%)" 
## 5 Admitted Male   C       120                    37 "n = 120\n(37%)"
## 6 Admitted Female C       202                    34 "n = 202\n(34%)"
#Step 2 : plot
ggplot(data = UCBAdmissions_sexrate, aes(x = Dept, y = percent_admission_sex, fill = Gender)) + 
  geom_bar(stat = "identity",position = "dodge") +
  scale_fill_jama() +
  geom_text(aes(label  = label, y = percent_admission_sex/2), position = position_dodge(width=1), color = "white", size = 2) + 
  theme_minimal() +
  xlab("Department") +
  ylab("Sex admission rate (in percentage)") +
  theme(legend.position = "bottom") +
  labs(title = "Admission rate at UC Berkeley in 1993 per sex and per department", subtitle = "Illustration of the Simpson's paradox", caption = "Data coming from UCBAdmissions dataset") #+

  #ggsave("resources/sex_admission_rate.png")

Exercise 2

Import data

Exercise 2 relies on the dataset “relig_income” coming from the tidyverse package. The dataset store the results of a religion vs income survey : number of survey respondents in each category religion/income. The incomes are cut into 10k dollar ranges. Note that there are special categories for “missing” data : “Don’t know/ refused”.

relig_income
## # A tibble: 18 x 11
##    religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
##    <chr>      <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
##  1 Agnostic      27        34        60        81        76       137        122
##  2 Atheist       12        27        37        52        35        70         73
##  3 Buddhist      27        21        30        34        33        58         62
##  4 Catholic     418       617       732       670       638      1116        949
##  5 Don’t k…      15        14        15        11        10        35         21
##  6 Evangel…     575       869      1064       982       881      1486        949
##  7 Hindu          1         9         7         9        11        34         47
##  8 Histori…     228       244       236       238       197       223        131
##  9 Jehovah…      20        27        24        24        21        30         15
## 10 Jewish        19        19        25        25        30        95         69
## 11 Mainlin…     289       495       619       655       651      1107        939
## 12 Mormon        29        40        48        51        56       112         85
## 13 Muslim         6         7         9        10         9        23         16
## 14 Orthodox      13        17        23        32        32        47         38
## 15 Other C…       9         7        11        13        13        14         18
## 16 Other F…      20        33        40        46        49        63         46
## 17 Other W…       5         2         3         4         2         7          3
## 18 Unaffil…     217       299       374       365       341       528        407
## # … with 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>, `Don't
## #   know/refused` <dbl>

The goal of this exercise is to plot the distribution of religious beliefs faceted by income ranges. You see that the way the data is presented (with the income ranges in different columns) prevents us from directly using “facet_wrap” or “group_by” function. One commonly says that the dataset is in a “wide” format. We will present two methods to override this issue :

  1. Draw separately one plot per income range and gather all the plots using the package cowplot. Though being a good introduction to cowplot, we will see why this method proves to be very inefficient here.

  2. Transform the dataset from a wide format to a long format with tidyverse tools in order to use “facet_wrap”. This method is much more recommended.

Method 1

Question 1

Let’s focus for this question on a given income range (let’s say less than 10k dollars). Note that you surround the colnames by “`” since they contain sepcial character (which is something you should definitely avoid when creating datasets).

  1. Compute the rounded percentage of religious affiliations for this income range. Arrange the dataset per decreasing percentage and keep only the 5th top rows (you can use the function head)

  2. Plot the distibution of the top 5 religious affiliations as a bar plot. Make sure to rename the y axis label properly as well as to choose a color palette for religion. Use geom_text to write the percentage of each religion for this income range on top of the bars (you can use vjust = 0 to align the text to the top). Remove the x axis text and x axis title since it is redondant with the legend and position the legend at the bottom of the plot. Once done, rename your ggplot object as relig_income_less_10k.

#Step 1 
relig_income_one_range <- relig_income %>% 
  select(religion, `<$10k`) %>%
  mutate(percent = round(100 *`<$10k` / sum(`<$10k`))) %>%
  arrange(desc(percent)) %>%
  head(5)

#Step 2
relig_income_less_10k = ggplot(data = relig_income_one_range, aes(x = religion, y = `<$10k`, fill = religion)) +
  geom_text(aes(label= paste0(percent,"%")), color = "black", vjust = 0) + 
  geom_bar(stat = "identity") +
  scale_fill_npg(name = "Religion") + 
  xlab("") +
  ylab("Count") + 
  ggtitle("Distibution of the religous beliefs for <$10k incomes")+ 
  theme(legend.position = "bottom", 
        axis.text.x = element_blank())
relig_income_less_10k

Question 2

Reproduce the code above for data income range “more than 150k dollars”. Call your plot relig_income_more_150k. What issue can you see with the color palette?

#Step 1 
relig_income_one_range <- relig_income %>% 
  select(religion, `>150k`) %>%
  mutate(percent = round(100 *`>150k` / sum(`>150k`))) %>%
  arrange(desc(percent)) %>%
  head(5)

#Step 2
relig_income_more_150k = ggplot(data = relig_income_one_range, aes(x = religion, y = `>150k`, fill = religion)) +
  geom_text(aes(label= paste0(percent,"%")), color = "black", vjust = 0) + 
  geom_bar(stat = "identity") +
  scale_fill_npg(name = "Religion") + 
  xlab("") +
  ylab("Count") + 
  ggtitle("Distibution of the religous beliefs for >*150k incomes")+ 
  theme(legend.position = "bottom", 
        axis.text.x = element_blank())
relig_income_more_150k

Correction The top five religions differ among the two plots. Since the two plots were generated independently with the same color palette, the color affiliation changes : there is no common legend.

Question 3

Use the function plot_grid from the cowplot package to gather the two plots into one figure (on the same row). Are you satsified with the results?

#?plot_grid
plot_grid(relig_income_less_10k,
          relig_income_more_150k,
          nrow = 1 )

Correction : Plot_grid function is very useful to gather plots (easy to use, automatic alginements). Yet, the plots to gather must be drawn independently. As a result, there is no common legend in the final plot. Though being vey useful to gather plots coming from different sources into one figure, cowplot is less efficient that facet_wrap to stratify studies according to one variable (the income range in this case).

You have seen how to gather two plots into a single one using plot_grid. Acutually, plot_grid can take as argument a list of plots, meaning that we could have used it to draw bar plots for each level of incomes (using a for loop or a function such as lapply to generate the list of plots).

In next section, we will foccus on another method, much more useful in practice, to plot hte distribution of top 5 religious beliefs per income range.

Method 2

The relig_income dataset is stored into a wide format. In this section, we are going to convert it into a “long” format meaning that we are going to create a new variable storing income ranges. The dataset in a long format will have only 3 columns (religion, income and count) but more rows than the original dataset in wide format : one row per couple religion/income range.

Question 4

Transform relig_income into relig_income_long using the function pivot_longer function from the tidyverse package.

?pivot_longer

relig_income_long <- relig_income %>%
  pivot_longer(!religion,
               names_to = "income", 
               values_to = "count")

nrow(relig_income_long) #18 religions * 10 income ranges = 180 
## [1] 180

Question 5

Compute the rounded percentage of people belonging to each religion per income range. Keep only the 5 top religions per income range (you can use group_by and slice_max function).

relig_income_long_top5 <- relig_income_long %>%
  group_by(income) %>%
  dplyr::mutate(percent = round(100*count/sum(count))) %>%
  slice_max(order_by = count, n = 5)
relig_income_long_top5
## # A tibble: 50 x 4
## # Groups:   income [10]
##    religion                income count percent
##    <chr>                   <chr>  <dbl>   <dbl>
##  1 Evangelical Prot        <$10k    575      30
##  2 Catholic                <$10k    418      22
##  3 Mainline Prot           <$10k    289      15
##  4 Historically Black Prot <$10k    228      12
##  5 Unaffiliated            <$10k    217      11
##  6 Mainline Prot           >150k    634      24
##  7 Catholic                >150k    633      24
##  8 Evangelical Prot        >150k    414      16
##  9 Unaffiliated            >150k    258      10
## 10 Jewish                  >150k    151       6
## # … with 40 more rows

Question 6

Plot the distribution of top 5 religions facetted by income ranges (using facet_wrap). Make sure to rename the y axis label properly as well as to choose a color palette for religion. Use geom_text to write the percentage of each religion for this income range on top of the bars (you can use vjust = 0 to align the text to the top). Remove the x axis text and title since it is redondant with the legend and position the legend at the bottom of the plot.

ggplot(data = relig_income_long_top5, aes(x = religion, y = percent, fill = religion)) +
  geom_text(aes(label= paste0(percent,"%")), color = "black", vjust = 0) + 
  geom_bar(stat = "identity") +
  scale_fill_npg(name = "Religion") + 
  xlab("") +
  ylab("Count") + 
  facet_wrap( .~ income) + 
  ggtitle("Distribution of the religious beliefs")+ 
  theme(legend.position = "bottom", 
        axis.text.x = element_blank())

You should see that facet_wrap creates a common legend (a religion has the same color in all the plots) and leaves blanks when a religion does stand in the top 5 religion for a given income range.

Question 7

We still have an issue concerning the order of the facet wrap plots which seems to follow the alphabetical order, while we would have liked it to be arranged from lower to higher incomes.

One way to cope with this is to set the levels of the variable income (seen as a factor variable) either by hand or by using regular expressions. In this case, we can use the order defined by the columns of the wide relig_income dataset, which were already in the good order. Re-do the plot after arranging the levels of the variable income.

relig_income_long_top5$income <- factor(relig_income_long_top5$income, levels = setdiff(colnames(relig_income),"religion"))

ggplot(data = relig_income_long_top5, aes(x = religion, y = percent, fill = religion)) +
  geom_text(aes(label= paste0(percent,"%")), color = "black", vjust = 0) + 
  geom_bar(stat = "identity") +
  scale_fill_npg(name = "Religion") + 
  xlab("") +
  ylab("Count") + 
  facet_wrap( .~ income) + 
  ggtitle("Distribution of the religious beliefs")+ 
  theme(legend.position = "bottom", 
        axis.text.x = element_blank())