Quickly aggregate your data in R with data.table
2015-01-28
In genomics data, we often have multiple measurements for each gene. Sometimes we want to aggregate those measurements with the mean, median, or sum. The data.table R package can do this quickly with large datasets.
In this note, we compute the average of multiple measurements for each gene in a gene expression matrix.
To see what else you can do with data.table
, check out these
fantastic cheat sheets:
Summary#
- Make random data
- Aggregate quickly with
data.table
- Aggregate slowly with
stats::aggregate()
Step 1. Make random data#
random_string <- function(n, chars) {
replicate(n = n, expr = {
paste(sample(LETTERS, chars, replace = TRUE), collapse = "")
})
}
set.seed(42)
# Each gene is represented by 1 or more probes.
n_probes <- 1e5
gene_names <- random_string(n_probes, 3)
sort(table(gene_names), decreasing = TRUE)[1:10]
#> gene_names
#> BWU DIH QLP WAX XVA ZPT FBE HMO LZS NTD
#> 17 16 16 16 16 16 15 15 15 15
library(data.table)
#>
#> Attaching package: 'data.table'
#> The following objects are masked from 'package:dplyr':
#>
#> between, first, last
#> The following object is masked from 'package:purrr':
#>
#> transpose
d <- data.table(
Gene = gene_names,
Probe = seq_along(gene_names)
)
# We measured genes in a number of samples.
n_samples <- 100
for (i in seq(n_samples)) {
d[[sprintf("S%s", i)]] <- rnorm(nrow(d))
}
# Now we have a gene expression matrix.
# Notice that gene "AAB" is represented by multiple probes.
d <- d[order(d$Gene)]
d[1:5,1:5]
#> Gene Probe S1 S2 S3
#> 1: AAA 17339 0.6886677 -0.59704276 0.73534953
#> 2: AAA 19529 0.2430747 0.48148142 -0.07411017
#> 3: AAA 19915 0.4096048 -0.02989425 -0.91970815
#> 4: AAA 34545 -0.4604622 -2.02437078 -0.09687003
#> 5: AAA 81092 0.9990284 -1.22508517 1.32621912
Step 2. Aggregate quickly with data.table#
Now we can easily average the probes for each gene.
system.time({
d_mean <- d[, lapply(.SD, mean), by = Gene, .SDcols = sprintf("S%s", 1:100)]
})
#> user system elapsed
#> 0.051 0.000 0.051
d_mean[1:5,1:5]
#> Gene S1 S2 S3 S4
#> 1: AAA 0.33549805 -0.9506345 0.04691119 -0.08173554
#> 2: AAC -0.46102761 -0.2623302 -0.03353125 0.20238075
#> 3: AAD 0.02861623 0.6479871 0.42838896 -0.28628726
#> 4: AAE 0.04544300 -0.3214495 0.33347142 0.11855635
#> 5: AAF -0.25445109 -0.5118745 -0.24236236 0.03974918
Step 3. Aggregate slowly with stats::aggregate()#
The base R function stats::aggregate()
can do the same thing, but it is
much slower.
dat <- data.frame(d)
system.time({
d_mean2 <- aggregate(dat[, 3:102], by = list(dat$Gene), mean)
})
#> user system elapsed
#> 9.529 0.060 9.596
The results are identical:
colnames(d_mean2)[1] <- "Gene"
all.equal(d_mean, data.table(d_mean2))
#> [1] TRUE
Feel free to read the source code for this post.