Harmony in motion: visualize an iterative algorithm for aligning multiple datasets

2019-08-25

Harmony is a an algorithm for aligning multiple high-dimensional datasets, described by Ilya Korsunsky et al. in this paper. When analyzing multiple single-cell RNA-seq datasets, we often encounter the problem that each dataset is separated from the others in low dimensional space – even when we know that all of the datasets have similar cell types. To address this problem, the Harmony algorithm iteratively clusters and adjusts high dimensional datasets to integrate them on a single manifold. In this note, we will create animated visualizations to see how the algorithm works and develop an intuitive understanding.

🐎 Harmony in motion

Let’s make this animation:

Single-cell RNA-seq data reduced to two dimensions with PCA and UMAP. 3,500 peripheral blood mononuclear cells (PBMCs). Each cell is from one of the donors (A, B, and C). Six markers are shown for different cell types. Color indicates gene expression as log2(CPM+1) (CPM is counts per million). The animation starts with unadjusted PCs, and then cycles through 3 iterations of the Harmony algorithm applied to the PCs.

GeneCell type
CD3DT cells
CD8ACD8 T cells
GZMKNKT cells
CD20B cells
CD14Monocytes
CD16Natural killer cells, neutrophils, monocytes, and macrophages

Install packages#

These are the critical R packages I used to write this note:

devtools::install_github("jefworks/MUDAN")

devtools::install_github("immunogenomics/harmony")
devtools::install_github("immunogenomics/presto")

install.packages("gganimate")

Run Principal Component Analysis#

The MUDAN package comes pre-loaded with some single-cell RNA-seq data published by 10x Genomics. Let’s load it into our R session.

data("pbmcA")
dim(pbmcA)
#> [1] 13939  2896
data("pbmcB")
dim(pbmcB)
#> [1] 15325  7765
data("pbmcC")
dim(pbmcC)
#> [1] 16144  9493

Some of the cells with very high read counts or very low read counts can be difficult to cluster together with the other cells. For this reason, we exclude some of the outlier cells from each dataset.

exclude_outliers <- function(mat, low = 0.06, high = 0.94) {
  mat_sum <- colSums(mat)
  qs <- quantile(mat_sum, c(low, high))
  ix <- mat_sum > qs[1] & mat_sum < qs[2]
  return(mat[, ix])
}

pbmcA <- exclude_outliers(pbmcA)
pbmcB <- exclude_outliers(pbmcB)
pbmcC <- exclude_outliers(pbmcC)

# Genes and cells in each counts matrix.
dim(pbmcA)
#> [1] 13939  2548
dim(pbmcB)
#> [1] 15325  6831
dim(pbmcC)
#> [1] 16144  8351

pbmcA <- pbmcA[, 1:500] # take 500 cells
pbmcB <- pbmcB[, 1:2000] # take 2000 cells
pbmcC <- pbmcC[, 1:1000] # take 1000 cells

# Concatenate into one counts matrix.
genes.int <- intersect(rownames(pbmcA), rownames(pbmcB))
genes.int <- intersect(genes.int, rownames(pbmcC))
counts <- cbind(pbmcA[genes.int,], pbmcB[genes.int,], pbmcC[genes.int,])
rm(pbmcA, pbmcB, pbmcC)

# A dataframe that indicates which donor each cell belongs to.
meta <- str_split_fixed(colnames(counts), "_", 2)
meta <- as.data.frame(meta)
colnames(meta) <- c("donor", "cell")
head(meta)
#>   donor           cell
#> 1     A AAACATTGCACTAG
#> 2     A AAACATTGGCTAAC
#> 3     A AAACATTGTAACCG
#> 4     A AAACCGTGTGGTCA
#> 5     A AAACCGTGTTACCT
#> 6     A AAACGCACACGGGA

# Number of cells from each donor.
table(meta$donor)
#> 
#>    A    B    C 
#>  500 2000 1000

# How many genes (rows) and cells (columns)?
dim(counts)
#> [1] 13414  3500

# How many entries in the counts matrix are non-zero?
sparsity <- 100 * (1 - (length(counts@x) / Reduce("*", counts@Dim)))

The matrix of read counts is 94.9% sparse. That means that the vast majority of values in the matrix are zero.

Now, we can use functions provided by the MUDAN R package to:

  • normalize the counts matrix for number of reads per cell
  • normalize the variance for each gene
  • run principal component analysis (PCA)

# CPM normalization
counts_to_cpm <- function(A) {
  A@x <- A@x / rep.int(Matrix::colSums(A), diff(A@p))
  A@x <- 1e6 * A@x
  return(A)
}

cpm <- counts_to_cpm(counts)
log10cpm <- cpm
log10cpm@x <- log10(1 + log10cpm@x)

# variance normalize, identify overdispersed genes
cpm_info <- MUDAN::normalizeVariance(cpm, details = TRUE, verbose = FALSE) 

# 30 PCs on overdispersed genes
set.seed(42)
pcs <- getPcs(
  mat     = log10cpm[cpm_info$ods,],
  nGenes  = length(cpm_info$ods),
  nPcs    = 30,
  verbose = FALSE
)
pcs[1:5, 1:5]
#>                        PC1        PC2         PC3      PC4        PC5
#> A_AAACATTGCACTAG  1.767734 -14.418014   0.5740947 8.818121  2.7819395
#> A_AAACATTGGCTAAC  1.052417   6.005906 -10.2430982 4.035912 -1.5828038
#> A_AAACATTGTAACCG -2.031748   2.600434   1.5062893 5.392586  0.2284509
#> A_AAACCGTGTGGTCA -1.069945  -3.784165  -0.5634285 3.477583  4.9772192
#> A_AAACCGTGTTACCT  1.496703   5.978402 -11.1087743 3.943560 -1.5968947

dat_pca <- as.data.frame(
  str_split_fixed(rownames(pcs), "_", 2)
)
colnames(dat_pca) <- c("donor", "cell")
dat_pca <- cbind.data.frame(dat_pca, pcs)

p <- ggplot(dat_pca[sample(nrow(dat_pca)),]) +
  scale_size_manual(values = c(0.9, 0.3, 0.5)) +
  scale_color_manual(values = donor_colors) +
  theme(legend.position = "none")

p1 <- p + geom_point(aes(PC1, PC2, color = donor, size = donor))
p2 <- p + geom_point(aes(PC3, PC4, color = donor, size = donor))
p3 <- p + geom_point(aes(PC5, PC6, color = donor, size = donor))
p4 <- p + geom_point(aes(PC7, PC8, color = donor, size = donor))

this_title <- sprintf(
  "PCA with %s genes and %s cells from 3 PBMC samples, published by 10x Genomics",
  scales::comma(nrow(log10cpm[cpm_info$ods,])),
  scales::comma(nrow(dat_pca))
)

p1 + p2 + p3 + p4 + plot_layout(ncol = 4) + plot_annotation(title = this_title)

plot of chunk pca

Next, we can plot the density of cells along each PC, grouped by donor. Below, we can see that the distributions for each donor are shifted away from each other, especially along PC4.

plot of chunk pca-density

Use Harmony to adjust the PCs#

We can run Harmony to adjust the principal component (PC) coordinates from the two datasets. When we look at the adjusted PCs, we can see that the donors no longer clearly separate along any of the principal components.

plot of chunk harmonize

And the densitites:

plot of chunk pca-density-adjusted

Now we can proceed with our analysis using the adjusted principal component coordinates.

Group the cells into clusters#

To group cells into clusters, we can use the MUDAN R package again. First, we build a nearest neighbor network and then we can identify clusters (or communities) in the network of cells with the Louvain community detection algorithm or the Infomap algorithm.

# Joint clustering
set.seed(42)
com <- MUDAN::getComMembership(
  mat = harmonized,
  k = 30,
  # method = igraph::cluster_louvain
  method = igraph::cluster_infomap
)
#> [1] "finding approximate nearest neighbors ..."
#> [1] "calculating clustering ..."
#> [1] "graph modularity: 0.683895626832917"
#> [1] "identifying cluster membership ..."
#> com
#>    1    2    3    4    5    6    7    8    9 
#> 1086  613  563  421  328  183  141  126   39
dat_harmonized$cluster <- com

plot of chunk plot-mudan-clustering

Compute differential gene expression for each cluster#

To find genes that are highly expressed in each cluster, we can use the presto R package by Ilya Korsunsky. It will efficiently compute differential expression statistics:

gene_stats <- presto::wilcoxauc(log10cpm, com)

# Sort genes by the difference between:
# - percent of cells expressing the gene in the cluster
# - percent of cells expressing the gene outside the cluster
gene_stats %>%
  group_by(group) %>%
  top_n(n = 2, wt = pct_in - pct_out) %>%
  mutate_if(is.numeric, signif, 3) %>%
  kable() %>%
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
#> `mutate_if()` ignored the following grouping variables:
#> Column `group`
feature group avgExpr logFC statistic auc pval padj pct_in pct_out
LTB 1 3.06 1.30 1810000 0.689 0 0 92.9 53.40
LDHB 1 2.88 1.26 2020000 0.771 0 0 91.7 54.30
IL7R 2 1.72 1.17 1250000 0.706 0 0 59.2 19.40
LTB 2 3.37 1.45 1410000 0.797 0 0 97.4 58.90
FGFBP2 3 3.00 2.79 1530000 0.926 0 0 90.6 6.67
GZMH 3 3.18 2.96 1570000 0.947 0 0 93.4 7.35
FCN1 4 3.35 3.29 1290000 0.992 0 0 98.6 2.34
CST3 4 3.51 3.41 1290000 0.992 0 0 98.8 3.51
CD79B 5 2.70 2.55 952000 0.915 0 0 84.8 5.45
CD79A 5 2.95 2.92 987000 0.948 0 0 89.9 1.23
GZMK 6 2.88 2.62 555000 0.915 0 0 90.2 8.23
NKG7 6 3.02 1.88 435000 0.716 0 0 90.2 30.90
GNLY 7 3.98 3.48 456000 0.963 0 0 98.6 14.30
GZMB 7 3.11 2.79 439000 0.927 0 0 91.5 9.94
CCL5 8 3.68 2.48 368000 0.865 0 0 98.4 33.90
NKG7 8 3.40 2.24 334000 0.785 0 0 96.0 31.70
GNLY 9 3.99 3.39 128000 0.952 0 0 97.4 16.80
TYROBP 9 2.90 2.18 109000 0.810 0 0 89.7 21.10

Save each iteration of Harmony#

Let’s run Harmony and limit the maximum number of iterations to 0, 1, 2, 3, 4, and 5. That way, we can track how the PC coordinates change after each iteration.

harmony_iters <- c(0, 1, 2, 3, 4, 5)
res <- lapply(harmony_iters, function(i) {
  set.seed(42)
  HarmonyMatrix(
    theta            = 0.15,
    data_mat         = pcs,
    meta_data        = meta$donor,
    do_pca           = FALSE,
    verbose          = FALSE,
    max.iter.harmony = i
  )
})
h <- do.call(rbind, lapply(harmony_iters, function(i) {
  x       <- as.data.frame(res[[i + 1]])
  x$iter  <- i
  y       <- str_split_fixed(rownames(x), "_", 2)
  x$donor <- y[,1]
  x$cell  <- y[,2]
  x
}))
h[1:6, c("iter", "donor", "PC1", "PC2", "PC3")]
#>                  iter donor       PC1        PC2         PC3
#> A_AAACATTGCACTAG    0     A  1.767734 -14.418014   0.5740947
#> A_AAACATTGGCTAAC    0     A  1.052417   6.005906 -10.2430982
#> A_AAACATTGTAACCG    0     A -2.031748   2.600434   1.5062893
#> A_AAACCGTGTGGTCA    0     A -1.069945  -3.784165  -0.5634285
#> A_AAACCGTGTTACCT    0     A  1.496703   5.978402 -11.1087743
#> A_AAACGCACACGGGA    0     A -2.889615   3.397147   2.3764580

For each run of Harmony, we will reduce 30 adjusted PCs to 2 dimensions with the UMAP algorithm implemented in the umap R package by Tomasz Konopka.

Since UMAP is a stochastic algorithm, the final layout of the cells on the canvas can be significantly different from one run to the next. We want to avoid these large stochastic changes, because our goal is to create a fluid animation with cells slowly drifting toward their final positions. To do this, we can pass the coordinates from the n-th iteration as the initial positions for UMAP in the n+1-th iteration.

# Settings for UMAP.
umap.settings <- umap.defaults
umap.settings$min_dist <- 0.8

# List of results.
res.umap <- list()

umap.seed <- 43
set.seed(umap.seed)

# Run UMAP on the first iteration of Harmony's adusted PCs.
res.umap[[1]] <- umap(d = res[[1]], config = umap.settings)

for (i in 2:length(res)) {
  print(i)
  # Initialize UMAP with the coordinates from the previous Harmony iteration.
  umap.settings$init <- res.umap[[i - 1]]$layout
  set.seed(umap.seed)
  res.umap[[i]] <- umap(d = res[[i]], config = umap.settings)
}
#> [1] 2
#> [1] 3
#> [1] 4
#> [1] 5
#> [1] 6

d <- do.call(rbind, lapply(seq_along(res.umap), function(i) {
  d       <- as.data.frame(res.umap[[i]]$layout)
  colnames(d) <- c("UMAP1", "UMAP2")
  d       <- cbind.data.frame(d, res[[i]])
  d$id    <- colnames(counts)
  y       <- str_split_fixed(d$id, "_", 2)
  d$donor <- y[,1]
  d$cell  <- y[,2]
  d$iter  <- i - 1
  return(d)
})) %>% group_by(iter) %>% arrange(cell)

# Add a column with the cluster membership for each cell.
d$cluster <- com[as.character(d$id)]
head(d)
#> # A tibble: 6 x 37
#> # Groups:   iter [6]
#>    UMAP1 UMAP2   PC1   PC2   PC3    PC4   PC5    PC6   PC7   PC8   PC9
#>    <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>
#> 1 -0.517 11.9   20.1  1.44  2.82 -0.246 1.14   1.45   1.16  2.02 0.452
#> 2 -4.44  11.7   19.3  1.81  2.84  1.75  0.175 -0.448 -2.69  1.03 1.24 
#> 3 -0.395  9.48  19.2  1.84  2.90  1.80  0.160 -0.418 -2.80  1.07 1.29 
#> 4  0.562  9.45  19.2  1.84  2.90  1.80  0.160 -0.418 -2.80  1.07 1.29 
#> 5  3.30  11.8   19.2  1.84  2.90  1.80  0.160 -0.418 -2.80  1.07 1.29 
#> 6  6.15  13.4   19.2  1.84  2.90  1.80  0.160 -0.418 -2.80  1.07 1.29 
#> # … with 26 more variables: PC10 <dbl>, PC11 <dbl>, PC12 <dbl>,
#> #   PC13 <dbl>, PC14 <dbl>, PC15 <dbl>, PC16 <dbl>, PC17 <dbl>,
#> #   PC18 <dbl>, PC19 <dbl>, PC20 <dbl>, PC21 <dbl>, PC22 <dbl>,
#> #   PC23 <dbl>, PC24 <dbl>, PC25 <dbl>, PC26 <dbl>, PC27 <dbl>,
#> #   PC28 <dbl>, PC29 <dbl>, PC30 <dbl>, id <chr>, donor <chr>, cell <chr>,
#> #   iter <dbl>, cluster <fct>

# Set a random order for the cells to avoid overplotting issues.
set.seed(42)
random <- sample(table(d$iter)[1])
d <- d %>% group_by(iter) %>% mutate(random = random) %>% arrange(iter, random)

Harmony iterations colored by donor:

plot of chunk plot-harmony-iterations

Harmony iterations colored by cluster:

plot of chunk plot_cluster

Below, we can see how the mean PC values change with each iteration of the Harmony algorithm. Each line represents the mean of cells that belong to one donor and one cluster of cells.

All cells are not adjusted by the same amount. Instead, each cell gets slightly different adjustments. Let’s look at one example by focusing on clusters 3 and 4 (the columns in the grid below).

Clusters 3 and 4 are each adjusted by different amounts: - PC4 is adjusted more for cluster 3 than cluster 4. - PC7 is adjusted more for cluster 4 than cluster 3.

plot of chunk harmony-iterations-lines

Use gganimate to transition between each iteration#

To create an animation, we can use the gganimate package by Thomas Lin Pedersen to create smooth transitions between each iteration of Harmony:

animate_donor <- function(
  filename, x, y,
  nframes = 15, width = 250, height = 250 / 1.41, res = 150,
  iters = 0:5, hide_legend = FALSE
) {
  this_title <- "Harmony Iteration {closest_state}"
  # p <- plot_donor(subset(d, iter %in% iters), {{x}}, {{y}})
  x <- parse_expr(x)
  y <- parse_expr(y)
  p <- plot_donor(subset(d, iter %in% iters), !!x, !!y)
  if (hide_legend) {
    p <- p + theme(legend.position = "none")
  } else {
    p <- p + ggtitle(this_title)
  }
  p <- p + transition_states(iter, transition_length = 4, state_length = 1) +
    ease_aes("cubic-in-out")
  animation <- animate(
    plot = p,
    nframes = nframes, width = width, height = height, res = res
  )
  dir.create("animation", showWarnings = FALSE)
  print(filename)
  anim_save(filename, animation)
}

nframes <- 200
iters <- 0:3
animate_donor(
  filename = "static/notes/harmony-animation/donor-umap.gif",
  nframes = nframes, iters = iters,
  x = "UMAP1", y = "UMAP2", width = 800, height = 900 / 1.41
)

We can also look at PCs. For example, take a closer look at cells moving along PC1. You might notice that the cells on the right side of the panel move more than the cells on the left side of the panel. This tells us all cells do not get an identical adjustment along PC1. Instead, each cell’s PC1 value is adjusted differently than its neighbors.

PC1 and PC2
PC4 and PC6
PC4 and PC6

By coloring points with gene expression values, we can focus on specific genes:

CD3D
CD8A
GZMK
CD79A
CD19
CD14
FCGR3A
MS4A1

Use HTML and CSS to arrange the GIFs#

Finally, we can use HTML and CSS to arrange the GIF files next to each other on an HTML page.

<style>
.column {
  max-width: 17%;
}
.column2 {
  max-width: 60.5%;
}
</style>

<div> 
  <div class="column2">
    <img src="donor.gif">
  </div>
  <div class="column">
    <img src="CD3D.gif">
    <img src="CD8A.gif">
    <img src="GZMK.gif">
  </div>  
  <div class="column">
    <img src="MS4A1.gif">
    <img src="CD14.gif">
    <img src="FCGR3A.gif">
  </div>  
</div>

See several examples of different arrangements at this link. (Right-click and view source to see the HTML and CSS.)

Once the GIF files are arranged nicely on the HTML page, we can use LICEcap to capture what appears on screen into a GIF file. That way, a single GIF file contains multiple views of the data. Lastly, we can edit the speed, quality, and other features of the captured GIF file after uploading it to ezgif.com.

Read the paper#

To learn more about Harmony, please check out the paper. It has many examples showing how to align multiple different data sets.

See the original data#

In this post, we used the data included in the MUDAN package. The original source of the data is 10x Genomics.

Get the original data published by 10x Genomics:

Edit this page on GitHub