# Quickly aggregate your data with data.table in R

I wrote a function using data.table to replace the default aggregate function in R. It runs about 100 times faster on my data (0.32 seconds instead of 33 seconds). I use it for microarray gene expression data, where I compute the mean expression values for genes that are represented by more than one probe on the microarray.

Below, `rmaExp`

is a matrix of expression values, where each row is a probe on
the array and each column is an experiment. I have 36 experiments and 53,617
probes.

Many genes are represented by more than one probe, so I want to collapse those rows by taking the mean of the probe values.

This is a (slow) way to do it with base R functions:

```
dim(rmaExp)
# [1] 53617 36
length(entrezids)
# [1] 53617
system.time({
dat <- data.frame(rmaExp)
dat$entrezid <- entrezids
dat <- aggregate(dat[ , 1:36], by = list(dat$entrezid), mean, na.rm = TRUE)
rownames(dat) <- dat$Group.1
dat <- dat[ , 2:37]
})
# user system elapsed
# 32.941 0.118 33.058
```

Here’s a function to get the same result, this time using the data.table and reshape2 packages:

(You must load `reshape2`

in order to use `melt`

on a `data.table`

.)

```
#' Take the mean of all columns of a matrix or dataframe, where rows are
#' aggregated by a vector of values. 100 times faster than stats::aggregate.
#'
#' @param dat A numeric matrix or data.frame.
#' @param xs A vector of groups (e.g. gene names).
#' @return A data.table with the aggregated mean for each group.
#' @seealso stats::aggregate
mean_by <- function(dat, xs) {
# Convert to data.table.
dat <- data.table(dat)
# Append the vector of group names as an extra column.
dat$agg_var <- xs
# Melt the data.table so all values are in one column called "value".
dat <- melt(dat, id.vars = "agg_var")
# Cast the data.table back into the original shape, and take the mean.
dat <- dcast.data.table(
dat, agg_var ~ variable, value.var = "value",
fun.aggregate = mean, na.rm = TRUE
)
rownames(dat) <- dat$agg_var
# Delete the extra column.
dat[ , agg_var := NULL]
dat
}
```

On my data, it is about 100 times faster, saving me about 30 seconds waiting for my results. It might not seem like much, but if you run a function many times each day, or if you interactively explore your data, then it makes a big difference.

```
system.time({ dat2 = mean_by(rmaExp, entrezids) })
# user system elapsed
# 0.320 0.092 0.411
```

The results are identical:

```
all(as.matrix(dat1) == as.matrix(dat2))
# [1] TRUE
```