Working with a sparse matrix in R

2020-03-11

Sparse matrices are necessary for dealing with large single-cell RNA-seq datasets. They require less memory than dense matrices, and they allow some computations to be more efficient. In this note, we'll discuss the internals of the dgCMatrix class with examples.

Install and load libraries#

Let's get started by installing and loading the Matrix package, which provides the sparse matrix classes that we use in this note.

install.packages("Matrix")
library(Matrix)

Below, we'll explore two Matrix formats and their corresponding classes:

  • The triplet format in class dgTMatrix
  • The compressed column format in class dgCMatrix

The triplet format in dgTMatrix#

dgTMatrix is a class from the Matrix R package that implements:

general, numeric, sparse matrices in (a possibly redundant) triplet format

The format is easy to understand:

  • Assume all unspecified entries in the matrix are equal to zero.
  • Define the non-zero entries in triplet form (i, j, x) where:
    • i is the row number
    • j is the column number
    • x is the value

That's all there is to it. Let's make one:

m <- Matrix(nrow = 3, ncol = 6, data = 0, sparse = TRUE)
m <- as(m, "dgTMatrix") # by default, Matrix() returns dgCMatrix
m[1,2] <- 10
m[1,3] <- 20
m[3,4] <- 30
m
## 3 x 6 sparse Matrix of class "dgTMatrix"
##                    
## [1,] . 10 20  . . .
## [2,] .  .  .  . . .
## [3,] .  .  . 30 . .

And let's see what is inside:

str(m)
## Formal class 'dgTMatrix' [package "Matrix"] with 6 slots
##   ..@ i       : int [1:3] 0 0 2
##   ..@ j       : int [1:3] 1 2 3
##   ..@ Dim     : int [1:2] 3 6
##   ..@ Dimnames:List of 2
##   .. ..$ : NULL
##   .. ..$ : NULL
##   ..@ x       : num [1:3] 10 20 30
##   ..@ factors : list()

The object has slots i, j, and x.

We can reconstruct the above sparse matrix like this:

d <- data.frame(
  i = m@i + 1,  # m@i is 0-based, not 1-based like everything else in R
  j = m@j + 1,  # m@j is 0-based, not 1-based like everything else in R
  x = m@x
)
d
##   i j  x
## 1 1 2 10
## 2 1 3 20
## 3 3 4 30
sparseMatrix(i = d$i, j = d$j, x = d$x, dims = c(3, 6))
## 3 x 6 sparse Matrix of class "dgCMatrix"
##                    
## [1,] . 10 20  . . .
## [2,] .  .  .  . . .
## [3,] .  .  . 30 . .

We can convert a sparse matrix to a data frame like this:

as.data.frame(summary(m))
##   i j  x
## 1 1 2 10
## 2 1 3 20
## 3 3 4 30

Since m@x gives us access to the data values, we can easily transform the values with log2():

m@x <- log2(m@x + 1)

Matrix Market files use the triplet format#

Matrix Market files often end with the file extension .mtx.

Write a Matrix Market file:

writeMM(m, "matrix.mtx")
## NULL

Dump the contents of the file:

readLines("matrix.mtx")
## [1] "%%MatrixMarket matrix coordinate real general"
## [2] "3 6 3"                                        
## [3] "1 2 3.4594316186372973"                       
## [4] "1 3 4.392317422778761"                        
## [5] "3 4 4.954196310386875"

This is the Matrix Market file format:

  • The first line is a comment (starts with %%).
  • The next line says there are 3 rows, 6 columns, and 3 non-zero values.
  • The next 3 lines describe the values in triplet format (i, j, x).

Read a Matrix Market file:

m <- readMM("matrix.mtx")
m
## 3 x 6 sparse Matrix of class "dgTMatrix"
##                                      
## [1,] . 3.459432 4.392317 .        . .
## [2,] . .        .        .        . .
## [3,] . .        .        4.954196 . .

Convert from dgTMatrix to dgCMatrix with:

as(m, "dgCMatrix")
## 3 x 6 sparse Matrix of class "dgCMatrix"
##                                      
## [1,] . 3.459432 4.392317 .        . .
## [2,] . .        .        .        . .
## [3,] . .        .        4.954196 . .

The compressed column format in dgCMatrix#

dgCMatrix is a class from the Matrix R package that implements:

general, numeric, sparse matrices in the (sorted) compressed sparse column format

This is the most common type of matrix that we will encounter when we are dealing with scRNA-seq data.

Let's make a sparse matrix in the dgCMatrix format:

library(Matrix)
m <- Matrix(nrow = 3, ncol = 6, data = 0, sparse = TRUE)
m[1,2] <- 10
m[1,3] <- 20
m[3,4] <- 30
m
## 3 x 6 sparse Matrix of class "dgCMatrix"
##                    
## [1,] . 10 20  . . .
## [2,] .  .  .  . . .
## [3,] .  .  . 30 . .

Let's look inside:

str(m)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   ..@ i       : int [1:3] 0 0 2
##   ..@ p       : int [1:7] 0 0 1 2 3 3 3
##   ..@ Dim     : int [1:2] 3 6
##   ..@ Dimnames:List of 2
##   .. ..$ : NULL
##   .. ..$ : NULL
##   ..@ x       : num [1:3] 10 20 30
##   ..@ factors : list()

The object has 6 slots, including Dim, i, x, and p.

Dim has dimensions of the matrix (3 rows, 6 columns):

m@Dim
## [1] 3 6

x has data values sorted column-wise (top to bottom, left to right):

m@x
## [1] 10 20 30

i has row indices for each data value. Note: i is 0-based, not 1-based like everything else in R.

m@i
## [1] 0 0 2

What about p? Unlike j, p does not tell us which column each data value us in.

m@p
## [1] 0 0 1 2 3 3 3

p has the cumulative number of data values as we move from one column to the next column, left to right. The first value is always 0, and the length of p is one more than the number of columns.

We can compute p for any matrix:

c(0, cumsum(colSums(m != 0)))
## [1] 0 0 1 2 3 3 3

Since p is a cumulative sum, we can use diff() to get the number of non-zero entries in each column:

diff(m@p)
## [1] 0 1 1 1 0 0
colSums(m != 0)
## [1] 0 1 1 1 0 0

The length of p is one more than the number of columns:

length(m@p)
## [1] 7
m@Dim[2] + 1
## [1] 7

Given p, we can compute j:

rep(1:m@Dim[2], diff(m@p))
## [1] 2 3 4

Most of the time, it's easier to use summary() to convert a sparse matrix to triplet (i, j, x) format.

summary(m)$j
## [1] 2 3 4

One more example might help to clarify how i, x, and p change as we modify the matrix:

# Add more values to the matrix
m[2,2] <- 50
m[2,3] <- 50
m[2,4] <- 50
m
## 3 x 6 sparse Matrix of class "dgCMatrix"
##                    
## [1,] . 10 20  . . .
## [2,] . 50 50 50 . .
## [3,] .  .  . 30 . .
str(m)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   ..@ i       : int [1:6] 0 1 0 1 1 2
##   ..@ p       : int [1:7] 0 0 2 4 6 6 6
##   ..@ Dim     : int [1:2] 3 6
##   ..@ Dimnames:List of 2
##   .. ..$ : NULL
##   .. ..$ : NULL
##   ..@ x       : num [1:6] 10 50 20 50 50 30
##   ..@ factors : list()

We know that p[1] is always 0.

Column 1 has 0 values, so p[2] is 0.

Column 2 has 2 values, so p[3] is 0 + 2 = 2.

Column 3 has 2 values, so p[4] is 2 + 2 = 4.

Column 4 has 2 values, so p[5] is 4 + 2 = 6.

Columns 5 and 6 have 0 values, so p[6] and p[7] are 6 + 0 = 6.

Sparse matrices use less memory than dense matrices#

# A large matrix
set.seed(1)
m <- sparseMatrix(
  i = sample(x = 1e4, size = 1e4),
  j = sample(x = 1e4, size = 1e4),
  x = rnorm(n = 1e4)
)
pryr::object_size(m)
## 162 kB
pryr::object_size(as.matrix(m)) # Dense matrices require much more memory (RAM)
## 800 MB

Compute the sparsity of the matrix:

sparsity <- length(m@x) / m@Dim[1] / m@Dim[2]
sparsity
## [1] 1e-04

When writing Matrix Market files, remember to use gzip compression to save disk space.

writeMM(m, "matrix.mtx")

The uncompressed file size:

bytes_uncompressed <- file.size("matrix.mtx")
scales::number_bytes(bytes_uncompressed)
## [1] "281 KiB"

Compressing the file can save 50% of the disk space:

system("gzip --keep matrix.mtx")
file.size("matrix.mtx.gz") / bytes_uncompressed
## [1] 0.4883699

It takes about the same amount of time to read uncompressed or compressed Matrix Market files:

bench::mark(
  m <- readMM("matrix.mtx"),
  m <- readMM("matrix.mtx.gz")
)
## # A tibble: 2 x 6
##   expression                        min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr>                   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 m <- readMM("matrix.mtx")      5.85ms   6.88ms      147.     626KB        0
## 2 m <- readMM("matrix.mtx.gz")    6.8ms   7.32ms      131.     626KB        0

writeMMgz#

Since the writeMM() function does not accept a connection object, this does not work:

writeMM(m, gzfile("matrix.mtx.gz")) ## This does not work :(

Instead, we can write our own function:

#' @param x A sparse matrix from the Matrix package.
#' @param file A filename that ends in ".gz".
writeMMgz <- function(x, file) {
  mtype <- "real"
  if (is(x, "ngCMatrix")) {
    mtype <- "integer"
  }
  writeLines(
    c(
      sprintf("%%%%MatrixMarket matrix coordinate %s general", mtype),
      sprintf("%s %s %s", x@Dim[1], x@Dim[2], length(x@x))
    ),
    gzfile(file)
  )
  data.table::fwrite(
    x = summary(x),
    file = file,
    append = TRUE,
    sep = " ",
    row.names = FALSE,
    col.names = FALSE
  )
}

Confirm that it works:

writeMMgz(m, "matrix2.mtx.gz")

all.equal(readMM("matrix.mtx.gz"), readMM("matrix2.mtx.gz"))
## [1] TRUE

Some operations on sparse matrices are fast#

Let's make a dense copy of the 10,000 by 10,000 sparse matrix.

d <- as.matrix(m)

Recall that only 10,000 (0.01%) of the entries in this matrices are non-zero.

Many operations are much faster on sparse matrices:

bench::mark(
  colSums(m),
  colSums(d)
)
## # A tibble: 2 x 6
##   expression      min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 colSums(m)  312.4µs  398.5µs    2515.      261KB        0
## 2 colSums(d)   90.9ms   91.9ms      10.9    78.2KB        0
bench::mark(
  rowSums(m),
  rowSums(d)
)
## # A tibble: 2 x 6
##   expression      min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 rowSums(m)    322µs    421µs   2310.     234.6KB     2.02
## 2 rowSums(d)    161ms    164ms      6.12    78.2KB     0

Suppose we want to collapse columns by summing groups of columns according to another variable.

set.seed(1)
y <- sample(1:10, size = ncol(m), replace = TRUE)
table(y)
## y
##    1    2    3    4    5    6    7    8    9   10 
##  980  937  972 1018  974  979 1072 1023 1015 1030

Let's turn the variable into a model matrix:

ymat <- model.matrix(~ 0 + factor(y))
colnames(ymat) <- 1:10
head(ymat)
##   1 2 3 4 5 6 7 8 9 10
## 1 0 0 0 0 0 0 0 0 1  0
## 2 0 0 0 1 0 0 0 0 0  0
## 3 0 0 0 0 0 0 1 0 0  0
## 4 1 0 0 0 0 0 0 0 0  0
## 5 0 1 0 0 0 0 0 0 0  0
## 6 0 0 0 0 0 0 1 0 0  0
colSums(ymat)
##    1    2    3    4    5    6    7    8    9   10 
##  980  937  972 1018  974  979 1072 1023 1015 1030

And now we can collapse the columns that belong to each group:

x1 <- m %*% ymat
x2 <- d %*% ymat
all.equal(as.matrix(x1), x2)
## [1] TRUE
all.equal(x1[,1], rowSums(m[,y == 1]))
## [1] TRUE
all.equal(x1[,2], rowSums(m[,y == 2]))
## [1] TRUE
dim(x1)
## [1] 10000    10
head(x1)
## 6 x 10 Matrix of class "dgeMatrix"
##      1 2 3 4 5 6           7           8          9        10
## [1,] 0 0 0 0 0 0  0.00000000  0.00000000 -0.5578692 0.0000000
## [2,] 0 0 0 0 0 0  0.74277916  0.00000000  0.0000000 0.0000000
## [3,] 0 0 0 0 0 0  0.00000000  0.00000000  0.0000000 1.5986887
## [4,] 0 0 0 0 0 0  0.00000000  0.00000000  0.0000000 0.8402201
## [5,] 0 0 0 0 0 0  0.00000000 -0.09295838  0.0000000 0.0000000
## [6,] 0 0 0 0 0 0 -0.05341102  0.00000000  0.0000000 0.0000000

On my machine, this operation on this data is 100 times faster with a sparse matrix than with a dense matrix.

bench::mark(
  m %*% ymat,
  d %*% ymat,
  check = FALSE
)
## # A tibble: 2 x 6
##   expression      min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 m %*% ymat  568.4µs    851µs    972.      1.75MB     4.13
## 2 d %*% ymat   96.7ms    105ms      9.67   781.3KB     0

R packages for working with sparse matrices#

You might consider trying these packages for working with sparse matrices in R:

Learn more#

Find more details about additional matrix formats in this vignettes from the Matrix R package.

And learn more about faster computations with sparse matrices in this vignette.