We start by loading a couple of packages for data manipulation and representation. Please install them using install.packages(“name_of_the_pacakge”) if not already done.
library(tidyverse) # advanced data manipulation and vizualisation
library(knitr) # R notebook export and formatting
library(ggsci) # Fancy color palettes inspired by scientific journals
library(cowplot) # To combine several plots on the same canvas
This exercise relies on the native R dataset “UCBAdmissions” which stores student admissions at UC Berkeley for the six largest departments in 1973 in function of the student gender. Please note that the data is originally stored as a 3-dimensional array, which we convert into a data frame for the rest of the exercise.
#TO RUN
#3-dim array UCBAdmissions
## , , Dept = A
##
## Gender
## Admit Male Female
## Admitted 512 89
## Rejected 313 19
##
## , , Dept = B
##
## Gender
## Admit Male Female
## Admitted 353 17
## Rejected 207 8
##
## , , Dept = C
##
## Gender
## Admit Male Female
## Admitted 120 202
## Rejected 205 391
##
## , , Dept = D
##
## Gender
## Admit Male Female
## Admitted 138 131
## Rejected 279 244
##
## , , Dept = E
##
## Gender
## Admit Male Female
## Admitted 53 94
## Rejected 138 299
##
## , , Dept = F
##
## Gender
## Admit Male Female
## Admitted 22 24
## Rejected 351 317
<- as.data.frame(UCBAdmissions) #Convert into a data frame
UCBAdmissions head(UCBAdmissions)
## Admit Gender Dept Freq
## 1 Admitted Male A 512
## 2 Rejected Male A 313
## 3 Admitted Female A 89
## 4 Rejected Female A 19
## 5 Admitted Male B 353
## 6 Rejected Male B 207
Question 1 : Have a closer look at the dataset : how many rows? How many variables? How many observations (i.e. number of applications)? How many different college departments are there? Briefly describe the variables.
%>% nrow() #24 rows UCBAdmissions
## [1] 24
%>% ncol() #4 variables UCBAdmissions
## [1] 4
%>% select(Dept) %>% unique() %>% nrow() #Number of departments UCBAdmissions
## [1] 6
sum(UCBAdmissions$Freq) #Total number of observations
## [1] 4526
Correction: The dataset comprises 24 rows and 4 variables : Admit (Admitted or Rejected), Gender (Male or Female), Dept (the university department) and Freq (the number of students). In total 4526 observations are stored (which is the sum of the column Freq). There are 6 different departments whose names go from A to F.
Question 2 : Using what you have learned about tidy data and the dplyr package, compute the total number of admissions and rejections per Gender as well as the admission rate (number of admissions divided by number of aplications) per Gender.
Do you see a difference of admission rates per Gender at Berkeley?
%>%
UCBAdmissions group_by(Gender, Admit) %>%
summarise(count = sum(Freq)) %>%
group_by(Gender) %>%
mutate(rate_per_gender = count / sum(count))
## # A tibble: 4 x 4
## # Groups: Gender [2]
## Gender Admit count rate_per_gender
## <fct> <fct> <dbl> <dbl>
## 1 Male Admitted 1198 0.445
## 2 Male Rejected 1493 0.555
## 3 Female Admitted 557 0.304
## 4 Female Rejected 1278 0.696
Correction : It seem that there is a huge difference between male admissions (about 45% percent of applications) and female admissions (only 30% of applications).
Question 3 : Compute now the number of :
Are there departments with more applications than others? Do the admission rates differ according to the department?
%>%
UCBAdmissions group_by(Dept, Admit) %>%
summarise(count = sum(Freq)) %>%
group_by(Dept) %>%
mutate(total_number_applications = sum(count)) %>%
mutate(rate_per_dpt = count / total_number_applications) %>%
filter(Admit == "Admitted") #No need to keep the rows for rejection since they do not add more information.
## # A tibble: 6 x 5
## # Groups: Dept [6]
## Dept Admit count total_number_applications rate_per_dpt
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 A Admitted 601 933 0.644
## 2 B Admitted 370 585 0.632
## 3 C Admitted 322 918 0.351
## 4 D Admitted 269 792 0.340
## 5 E Admitted 147 584 0.252
## 6 F Admitted 46 714 0.0644
Correction The number of applications per departement does not vary much : it goes from 584 (department E) to 933 (department A). There is yet a huge difference in admission rates according to the departments, since it goes from about 6% (department F) to 64% (department A).
Question 4 : Compute finally the rates of male/female applications per department (without taking into account the admission results).
Can you draw an hypothesis to explain the difference of rate admissions between male and female (question 2)?
%>%
UCBAdmissions group_by(Dept, Gender) %>%
summarise(count = sum(Freq)) %>%
group_by(Dept) %>%
mutate(total_number_applications = sum(count)) %>%
mutate(gender_rate_per_dpt = count / sum(count))
## # A tibble: 12 x 5
## # Groups: Dept [6]
## Dept Gender count total_number_applications gender_rate_per_dpt
## <fct> <fct> <dbl> <dbl> <dbl>
## 1 A Male 825 933 0.884
## 2 A Female 108 933 0.116
## 3 B Male 560 585 0.957
## 4 B Female 25 585 0.0427
## 5 C Male 325 918 0.354
## 6 C Female 593 918 0.646
## 7 D Male 417 792 0.527
## 8 D Female 375 792 0.473
## 9 E Male 191 584 0.327
## 10 E Female 393 584 0.673
## 11 F Male 373 714 0.522
## 12 F Female 341 714 0.478
Correction There is a huge difference in application rates per department according to the Gender : males tend to apply more to departments A (88%) and B (95%) , women tend to apply more to departments C (64%) and E (67%). There is about a 50% gender rate application for departments D and F. When crossing these results with the results from the previous question, it seems that women tend to apply more to departments with lower application rates (30 and 25% respectively for department C and E). This may explain the huge difference in admission rates between men and women.
The paradox you have illustrated in the previous part is called the Simpson’s paradox. You can read more about it here.
The goal of this section is to illustrate this paradox by plotting the sex admission rate (in percentage) per department.
Question 5 Please observe the plot thereafter and reproduce it using ggplot2.
include_graphics("resources/sex_admission_rate.png")
Hints :
The color palette is “JAMA”, which comes from the package ‘ggsci’ loaded at the beginning of the homework. You can use the function “scale_color_jama()” or “scale_fill_jama()”.
You can use the function round to round the percentage of admission per sex.
You can use the function paste0 to concatenate in a single string the number of applications (Freq) and the sex admission rate in percentage. The special character ‘\n’ is used for line break.
To use geom bar with pre-computed Frequencies (and thus with an “x” and a “y” aesthetics) you should use stat = “identity”.
To ask geom_text to use a dodge position, you can set the argument “position” to the value position_dodge(width =1).
The argument “size” from the function geom_text enables to change the font size.
#Step 1 : compute the pourcentage of admission rate per gender and per departments
<- UCBAdmissions %>%
UCBAdmissions_sexrate group_by(Dept, Gender) %>%
mutate(percent_admission_sex = round(100* Freq / sum(Freq))) %>%
filter(Admit == "Admitted") %>%
mutate(label = paste0("n = ", Freq, "\n(", percent_admission_sex,"%)"))
head(UCBAdmissions_sexrate)
## # A tibble: 6 x 6
## # Groups: Dept, Gender [6]
## Admit Gender Dept Freq percent_admission_sex label
## <fct> <fct> <fct> <dbl> <dbl> <chr>
## 1 Admitted Male A 512 62 "n = 512\n(62%)"
## 2 Admitted Female A 89 82 "n = 89\n(82%)"
## 3 Admitted Male B 353 63 "n = 353\n(63%)"
## 4 Admitted Female B 17 68 "n = 17\n(68%)"
## 5 Admitted Male C 120 37 "n = 120\n(37%)"
## 6 Admitted Female C 202 34 "n = 202\n(34%)"
#Step 2 : plot
ggplot(data = UCBAdmissions_sexrate, aes(x = Dept, y = percent_admission_sex, fill = Gender)) +
geom_bar(stat = "identity",position = "dodge") +
scale_fill_jama() +
geom_text(aes(label = label, y = percent_admission_sex/2), position = position_dodge(width=1), color = "white", size = 2) +
theme_minimal() +
xlab("Department") +
ylab("Sex admission rate (in percentage)") +
theme(legend.position = "bottom") +
labs(title = "Admission rate at UC Berkeley in 1993 per sex and per department", subtitle = "Illustration of the Simpson's paradox", caption = "Data coming from UCBAdmissions dataset") #+
#ggsave("resources/sex_admission_rate.png")
Exercise 2 relies on the dataset “relig_income” coming from the tidyverse package. The dataset store the results of a religion vs income survey : number of survey respondents in each category religion/income. The incomes are cut into 10k dollar ranges. Note that there are special categories for “missing” data : “Don’t know/ refused”.
relig_income
## # A tibble: 18 x 11
## religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Agnostic 27 34 60 81 76 137 122
## 2 Atheist 12 27 37 52 35 70 73
## 3 Buddhist 27 21 30 34 33 58 62
## 4 Catholic 418 617 732 670 638 1116 949
## 5 Don’t k… 15 14 15 11 10 35 21
## 6 Evangel… 575 869 1064 982 881 1486 949
## 7 Hindu 1 9 7 9 11 34 47
## 8 Histori… 228 244 236 238 197 223 131
## 9 Jehovah… 20 27 24 24 21 30 15
## 10 Jewish 19 19 25 25 30 95 69
## 11 Mainlin… 289 495 619 655 651 1107 939
## 12 Mormon 29 40 48 51 56 112 85
## 13 Muslim 6 7 9 10 9 23 16
## 14 Orthodox 13 17 23 32 32 47 38
## 15 Other C… 9 7 11 13 13 14 18
## 16 Other F… 20 33 40 46 49 63 46
## 17 Other W… 5 2 3 4 2 7 3
## 18 Unaffil… 217 299 374 365 341 528 407
## # … with 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>, `Don't
## # know/refused` <dbl>
The goal of this exercise is to plot the distribution of religious beliefs faceted by income ranges. You see that the way the data is presented (with the income ranges in different columns) prevents us from directly using “facet_wrap” or “group_by” function. One commonly says that the dataset is in a “wide” format. We will present two methods to override this issue :
Draw separately one plot per income range and gather all the plots using the package cowplot. Though being a good introduction to cowplot, we will see why this method proves to be very inefficient here.
Transform the dataset from a wide format to a long format with tidyverse tools in order to use “facet_wrap”. This method is much more recommended.
Question 1
Let’s focus for this question on a given income range (let’s say less than 10k dollars). Note that you surround the colnames by “`” since they contain sepcial character (which is something you should definitely avoid when creating datasets).
Compute the rounded percentage of religious affiliations for this income range. Arrange the dataset per decreasing percentage and keep only the 5th top rows (you can use the function head)
Plot the distibution of the top 5 religious affiliations as a bar plot. Make sure to rename the y axis label properly as well as to choose a color palette for religion. Use geom_text to write the percentage of each religion for this income range on top of the bars (you can use vjust = 0 to align the text to the top). Remove the x axis text and x axis title since it is redondant with the legend and position the legend at the bottom of the plot. Once done, rename your ggplot object as relig_income_less_10k.
#Step 1
<- relig_income %>%
relig_income_one_range select(religion, `<$10k`) %>%
mutate(percent = round(100 *`<$10k` / sum(`<$10k`))) %>%
arrange(desc(percent)) %>%
head(5)
#Step 2
= ggplot(data = relig_income_one_range, aes(x = religion, y = `<$10k`, fill = religion)) +
relig_income_less_10k geom_text(aes(label= paste0(percent,"%")), color = "black", vjust = 0) +
geom_bar(stat = "identity") +
scale_fill_npg(name = "Religion") +
xlab("") +
ylab("Count") +
ggtitle("Distibution of the religous beliefs for <$10k incomes")+
theme(legend.position = "bottom",
axis.text.x = element_blank())
relig_income_less_10k
Question 2
Reproduce the code above for data income range “more than 150k dollars”. Call your plot relig_income_more_150k. What issue can you see with the color palette?
#Step 1
<- relig_income %>%
relig_income_one_range select(religion, `>150k`) %>%
mutate(percent = round(100 *`>150k` / sum(`>150k`))) %>%
arrange(desc(percent)) %>%
head(5)
#Step 2
= ggplot(data = relig_income_one_range, aes(x = religion, y = `>150k`, fill = religion)) +
relig_income_more_150k geom_text(aes(label= paste0(percent,"%")), color = "black", vjust = 0) +
geom_bar(stat = "identity") +
scale_fill_npg(name = "Religion") +
xlab("") +
ylab("Count") +
ggtitle("Distibution of the religous beliefs for >*150k incomes")+
theme(legend.position = "bottom",
axis.text.x = element_blank())
relig_income_more_150k
Correction The top five religions differ among the two plots. Since the two plots were generated independently with the same color palette, the color affiliation changes : there is no common legend.
Question 3
Use the function plot_grid from the cowplot package to gather the two plots into one figure (on the same row). Are you satsified with the results?
#?plot_grid
plot_grid(relig_income_less_10k,
relig_income_more_150k,nrow = 1 )
Correction : Plot_grid function is very useful to gather plots (easy to use, automatic alginements). Yet, the plots to gather must be drawn independently. As a result, there is no common legend in the final plot. Though being vey useful to gather plots coming from different sources into one figure, cowplot is less efficient that facet_wrap to stratify studies according to one variable (the income range in this case).
You have seen how to gather two plots into a single one using plot_grid. Acutually, plot_grid can take as argument a list of plots, meaning that we could have used it to draw bar plots for each level of incomes (using a for loop or a function such as lapply to generate the list of plots).
In next section, we will foccus on another method, much more useful in practice, to plot hte distribution of top 5 religious beliefs per income range.
The relig_income dataset is stored into a wide format. In this section, we are going to convert it into a “long” format meaning that we are going to create a new variable storing income ranges. The dataset in a long format will have only 3 columns (religion, income and count) but more rows than the original dataset in wide format : one row per couple religion/income range.
Question 4
Transform relig_income into relig_income_long using the function pivot_longer function from the tidyverse package.
?pivot_longer
<- relig_income %>%
relig_income_long pivot_longer(!religion,
names_to = "income",
values_to = "count")
nrow(relig_income_long) #18 religions * 10 income ranges = 180
## [1] 180
Question 5
Compute the rounded percentage of people belonging to each religion per income range. Keep only the 5 top religions per income range (you can use group_by and slice_max function).
<- relig_income_long %>%
relig_income_long_top5 group_by(income) %>%
::mutate(percent = round(100*count/sum(count))) %>%
dplyrslice_max(order_by = count, n = 5)
relig_income_long_top5
## # A tibble: 50 x 4
## # Groups: income [10]
## religion income count percent
## <chr> <chr> <dbl> <dbl>
## 1 Evangelical Prot <$10k 575 30
## 2 Catholic <$10k 418 22
## 3 Mainline Prot <$10k 289 15
## 4 Historically Black Prot <$10k 228 12
## 5 Unaffiliated <$10k 217 11
## 6 Mainline Prot >150k 634 24
## 7 Catholic >150k 633 24
## 8 Evangelical Prot >150k 414 16
## 9 Unaffiliated >150k 258 10
## 10 Jewish >150k 151 6
## # … with 40 more rows
Question 6
Plot the distribution of top 5 religions facetted by income ranges (using facet_wrap). Make sure to rename the y axis label properly as well as to choose a color palette for religion. Use geom_text to write the percentage of each religion for this income range on top of the bars (you can use vjust = 0 to align the text to the top). Remove the x axis text and title since it is redondant with the legend and position the legend at the bottom of the plot.
ggplot(data = relig_income_long_top5, aes(x = religion, y = percent, fill = religion)) +
geom_text(aes(label= paste0(percent,"%")), color = "black", vjust = 0) +
geom_bar(stat = "identity") +
scale_fill_npg(name = "Religion") +
xlab("") +
ylab("Count") +
facet_wrap( .~ income) +
ggtitle("Distribution of the religious beliefs")+
theme(legend.position = "bottom",
axis.text.x = element_blank())
You should see that facet_wrap creates a common legend (a religion has the same color in all the plots) and leaves blanks when a religion does stand in the top 5 religion for a given income range.
Question 7
We still have an issue concerning the order of the facet wrap plots which seems to follow the alphabetical order, while we would have liked it to be arranged from lower to higher incomes.
One way to cope with this is to set the levels of the variable income (seen as a factor variable) either by hand or by using regular expressions. In this case, we can use the order defined by the columns of the wide relig_income dataset, which were already in the good order. Re-do the plot after arranging the levels of the variable income.
$income <- factor(relig_income_long_top5$income, levels = setdiff(colnames(relig_income),"religion"))
relig_income_long_top5
ggplot(data = relig_income_long_top5, aes(x = religion, y = percent, fill = religion)) +
geom_text(aes(label= paste0(percent,"%")), color = "black", vjust = 0) +
geom_bar(stat = "identity") +
scale_fill_npg(name = "Religion") +
xlab("") +
ylab("Count") +
facet_wrap( .~ income) +
ggtitle("Distribution of the religious beliefs")+
theme(legend.position = "bottom",
axis.text.x = element_blank())