Quickly aggregate your data in R with data.table

2015-01-28

In genomics data, we often have multiple measurements for each gene. Sometimes we want to aggregate those measurements with the mean, median, or sum. The data.table R package can do this quickly with large datasets.

In this note, we compute the average of multiple measurements for each gene in a gene expression matrix.

To see what else you can do with data.table, check out these fantastic cheat sheets:

Summary#

Make random data
Aggregate quickly with data.table
Aggregate slowly with stats::aggregate()

Step 1. Make random data#

random_string <- function(n, chars) {
  replicate(n = n, expr = {
    paste(sample(LETTERS, chars, replace = TRUE), collapse = "")
  })
}

set.seed(42)

# Each gene is represented by 1 or more probes.
n_probes <- 1e5
gene_names <- random_string(n_probes, 3)
sort(table(gene_names), decreasing = TRUE)[1:10]
#> gene_names
#> BWU DIH QLP WAX XVA ZPT FBE HMO LZS NTD 
#>  17  16  16  16  16  16  15  15  15  15

library(data.table)
#> 
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#> 
#>     between, first, last
#> The following object is masked from 'package:purrr':
#> 
#>     transpose
d <- data.table(
  Gene = gene_names,
  Probe = seq_along(gene_names)
)

# We measured genes in a number of samples.
n_samples <- 100
for (i in seq(n_samples)) {
  d[[sprintf("S%s", i)]] <- rnorm(nrow(d))
}

# Now we have a gene expression matrix.
# Notice that gene "AAB" is represented by multiple probes.
d <- d[order(d$Gene)]
d[1:5,1:5]
#>    Gene Probe         S1          S2          S3
#> 1:  AAA 17339  0.6886677 -0.59704276  0.73534953
#> 2:  AAA 19529  0.2430747  0.48148142 -0.07411017
#> 3:  AAA 19915  0.4096048 -0.02989425 -0.91970815
#> 4:  AAA 34545 -0.4604622 -2.02437078 -0.09687003
#> 5:  AAA 81092  0.9990284 -1.22508517  1.32621912

Step 2. Aggregate quickly with data.table#

Now we can easily average the probes for each gene.

system.time({
    d_mean <- d[, lapply(.SD, mean), by = Gene, .SDcols = sprintf("S%s", 1:100)]
})
#>    user  system elapsed 
#>   0.051   0.000   0.051
d_mean[1:5,1:5]
#>    Gene          S1         S2          S3          S4
#> 1:  AAA  0.33549805 -0.9506345  0.04691119 -0.08173554
#> 2:  AAC -0.46102761 -0.2623302 -0.03353125  0.20238075
#> 3:  AAD  0.02861623  0.6479871  0.42838896 -0.28628726
#> 4:  AAE  0.04544300 -0.3214495  0.33347142  0.11855635
#> 5:  AAF -0.25445109 -0.5118745 -0.24236236  0.03974918

Step 3. Aggregate slowly with stats::aggregate()#

The base R function stats::aggregate() can do the same thing, but it is much slower.

dat <- data.frame(d)
system.time({
  d_mean2 <- aggregate(dat[, 3:102], by = list(dat$Gene), mean)
})
#>    user  system elapsed 
#>   9.529   0.060   9.596

The results are identical:

colnames(d_mean2)[1] <- "Gene"
all.equal(d_mean, data.table(d_mean2))
#> [1] TRUE

Feel free to read the source code for this post.