We start by loading a couple of packages for data manipulation, dimension reduction clustering, fancy representations, etc.
library(tidyverse) # opinionated collection of packages for data manipulation
library(aricode) # fast computation of clustering measures
theme_set(theme_bw()) # plots themes
We start with the sample of scRNA data presented during the course on non-linear dimension reduction approaches. It consists in the normalized gene-level expression of 100 representative genes for a collection of 301 cells spread in 11 cell-lines.
The data are imported as follows:
load("data/scRNA.RData")
<- pollen$data %>% t() %>% as_tibble() %>%
scRNA add_column(cell_type = pollen$celltypes)
kmeans
) for 11 clusters a hundred of time and keep track of the within sum of squares.We replicate 100 runs of the kmeans with a single start:
<- scRNA %>% select(-cell_type) %>% as.matrix()
scRNA_matrix <- replicate(100, kmeans(scRNA_matrix, centers = 11, nstart = 1), simplify = FALSE) kmeans_out
Overall, there can be a variation of up to 50% of the total sum of squares for different initialization of the kmeans:
data.frame(WSS = map_dbl(kmeans_out, "tot.withinss"), run = as.character(1:100)) %>%
ggplot() + aes(y = WSS/max(WSS)) + geom_boxplot() + ggtitle("Variation of the rescaled Total sum of squares")
At the cluster level, one can see that some clusters are a bit more stable (unstable) than others, probably with individual at the boundary, more difficult to affect to a given group.
map(kmeans_out, "withinss") %>% # tranformation to a named data frame/tibble
do.call("rbind", .) %>% as_tibble() %>% setNames(paste0("cluster", 1:11)) %>%
pivot_longer(everything(), names_to = "cluster", values_to = "WSS") %>%
ggplot() + aes(x = cluster, y = WSS) + geom_boxplot() + ggtitle("Within sum of squares per cluster")
nstart
set to 20. Save the results.map
of any other apply
-like functional is your friend ! We go up to 30 clusters and make 50 restart for each run, in order to get rather smooth ARI/NID curves:
<- 1:30
nb_clusters <- map(nb_clusters, ~kmeans(scRNA_matrix, centers = ., nstart = 50)) kmeans_var_cluster
A consensus appear around 10 clusters: note that the curves are not perfectly smooth, partly due to convergence of the k-means algorithm to local minima.
<- map(kmeans_var_cluster, "cluster") %>% # extract the clusterings
ARI map_dbl(aricode::ARI, scRNA$cell_type) ## compute the ARI
<- map(kmeans_var_cluster, "cluster") %>% # extract the clusterings
NID map_dbl(aricode::NID, scRNA$cell_type) ## compute the ARI
tibble(ARI = ARI, NID = NID, nb_clusters = nb_clusters) %>%
pivot_longer(-nb_clusters, values_to = "value", names_to = "measure") %>%
group_by(measure) %>%
ggplot() + aes(x = nb_clusters, y = value, color = measure) + geom_line() +
ggtitle("K-means Clustering Performance")
hclust
+ cutree
).Alright then, hclust
+ cutree
with Ward criterion do a similar job to the k-means:
<- dist(scRNA_matrix, method = "euclidean") %>% hclust(method = "ward.D2")
hclust_var_cluster <- cutree(hclust_var_cluster, nb_clusters) %>% apply(2, ARI, scRNA$cell_type)
ARI <- cutree(hclust_var_cluster, nb_clusters) %>% apply(2, NID, scRNA$cell_type)
NID
tibble(ARI = ARI, NID = NID, nb_clusters = nb_clusters) %>%
pivot_longer(-nb_clusters, values_to = "value", names_to = "measure") %>%
group_by(measure) %>%
ggplot() + aes(x = nb_clusters, y = value, color = measure) + geom_line() +
ggtitle("Hierarchical Clustering Performance")
We use the SNP data studied during homework #3. We analyze the 5500 most variant SNP for 728 individuals with various origin, with the following descriptors:
The data are imported as follows:
load("data/SNP.RData")
<- data$Geno %>% as_tibble() %>% replace(is.na(.), 0) %>%
snp add_column(origin = data$origin, .before = 1)
Try the various algorithms seen during the course (kmeans, hierarchical clustering, spectral clustering, kernel versions), possibly with a step of dimension reduction first (e.g. PCA, t-SNE, UMAP). Try to get the best result in terms of ARI/NID compared to the natural classification (the origin of the population). If you have time, assess the robustness of your result in terms of ARI by performing resampling in the initial table snp (i.e., by running the analysis several times on a random subsample of your data, say 80%, and then averaging your results in terms of ARI).
A generally good baseline is hierarchical clustering (to chose visually the number of cluster) + k-means. In this case, they gave similar results:
<- dist(select(snp, -origin)) %>% hclust(, method = "ward.D2")
hclust_snp <- cutree(hclust_snp, 1:20) %>% apply(2, ARI, snp$origin)
ARI_hc <- kmeans(select(snp, -origin), centers = which.max(ARI_hc), nstart = 20)
kmeans_snp <- ARI(kmeans_snp$cluster, snp$origin)
ARI_km cat("\nHierarchical clustering: ", max(ARI_hc))
##
## Hierarchical clustering: 0.8194639
cat("\nK-means clustering: ", max(ARI_km))
##
## K-means clustering: 0.7898888
It’s going to be hard to beat!
I have tried umap with custom configuration, which is doing not so bad:
library(umap)
<- umap.defaults
custom.settings $n_components <- 4
custom.settings$n_neighbors <- 20
custom.settings
<- select(snp, -origin) %>% as.matrix() %>%
umap_out umap(config = custom.settings)
$layout %>% as.data.frame() %>% add_column(origin = snp$origin) %>%
umap_outggplot(aes(x = V1, y = V2, color = origin)) +
geom_point(size=1.25) +
guides(colour = guide_legend(override.aes = list(size=6)))
ARI(kmeans(umap_out$layout, centers = 6, nstart = 100)$cluster, snp$origin)
## [1] 0.68618
Finally, dimension reduction via SVD and kmeans on the projected points do a little better than hierarchical clustering.
<- select(snp, -origin) %>% scale(TRUE, FALSE) %>% svd(nv = 10)
SVD <- SVD$u[, 1:10] %*% diag(SVD$d[1:10]) %>% data.frame()
snp_proj %>% ggplot(aes(x = X1, y = X2, color = snp$origin)) +
snp_proj geom_point(size=1.25) +
guides(colour = guide_legend(override.aes = list(size=6)))
ARI(kmeans(snp_proj, centers = 6, nstart = 100)$cluster, snp$origin)
## [1] 0.8364261