Working with a sparse matrix in R

2020-03-11

Sparse matrices are necessary for dealing with large single-cell RNA-seq datasets. They require less memory than dense matrices, and they allow some computations to be more efficient. In this note, we'll discuss the internals of the dgCMatrix class with examples.

Install and load libraries#

Let's get started by installing and loading the Matrix package, which provides the sparse matrix classes that we use in this note.

install.packages("Matrix")
library(Matrix)

Below, we'll explore two Matrix formats and their corresponding classes:

  • The triplet format in class dgTMatrix
  • The compressed column format in class dgCMatrix

The triplet format in dgTMatrix#

dgTMatrix is a class from the Matrix R package that implements:

general, numeric, sparse matrices in (a possibly redundant) triplet format

The format is easy to understand:

  • Assume all unspecified entries in the matrix are equal to zero.
  • Define the non-zero entries in triplet form (i, j, x) where:
    • i is the row number
    • j is the column number
    • x is the value

That's all there is to it. Let's make one:

m <- Matrix(nrow = 3, ncol = 6, data = 0, sparse = TRUE)
m <- as(m, "dgTMatrix") # by default, Matrix() returns dgCMatrix
m[1,2] <- 10
m[1,3] <- 20
m[3,4] <- 30
m
## 3 x 6 sparse Matrix of class "dgTMatrix"
##                    
## [1,] . 10 20  . . .
## [2,] .  .  .  . . .
## [3,] .  .  . 30 . .

And let's see what is inside:

str(m)
## Formal class 'dgTMatrix' [package "Matrix"] with 6 slots
##   ..@ i       : int [1:3] 0 0 2
##   ..@ j       : int [1:3] 1 2 3
##   ..@ Dim     : int [1:2] 3 6
##   ..@ Dimnames:List of 2
##   .. ..$ : NULL
##   .. ..$ : NULL
##   ..@ x       : num [1:3] 10 20 30
##   ..@ factors : list()

The object has slots i, j, and x.

We can reconstruct the above sparse matrix like this:

d <- data.frame(
  i = m@i + 1,  # m@i is 0-based, not 1-based like everything else in R
  j = m@j + 1,  # m@j is 0-based, not 1-based like everything else in R
  x = m@x
)
d
##   i j  x
## 1 1 2 10
## 2 1 3 20
## 3 3 4 30
sparseMatrix(i = d$i, j = d$j, x = d$x, dims = c(3, 6))
## 3 x 6 sparse Matrix of class "dgCMatrix"
##                    
## [1,] . 10 20  . . .
## [2,] .  .  .  . . .
## [3,] .  .  . 30 . .

We can convert a sparse matrix to a data frame like this:

as.data.frame(summary(m))
##   i j  x
## 1 1 2 10
## 2 1 3 20
## 3 3 4 30

Since m@x gives us access to the data values, we can easily transform the values with log2():

m@x <- log2(m@x + 1)

Matrix Market files use the triplet format#

Many sparse Matrix Market files end with the file extension .mtx.

Write a Matrix Market file:

writeMM(m, "matrix.mtx")
## NULL
readLines("matrix.mtx")
## [1] "%%MatrixMarket matrix coordinate real general"
## [2] "3 6 3"                                        
## [3] "1 2 3.4594316186372973"                       
## [4] "1 3 4.392317422778761"                        
## [5] "3 4 4.954196310386875"

Matrix Market file format:

  • The first line is a comment (starts with %%).
  • The next line says there are 3 rows, 6 columns, and 3 non-zero values.
  • The next 3 lines describe the values in triplet format (i, j, x).

Read a Matrix Market file:

m <- readMM("matrix.mtx")
m
## 3 x 6 sparse Matrix of class "dgTMatrix"
##                                      
## [1,] . 3.459432 4.392317 .        . .
## [2,] . .        .        .        . .
## [3,] . .        .        4.954196 . .

Convert from dgTMatrix to dgCMatrix with:

as(m, "dgCMatrix")
## 3 x 6 sparse Matrix of class "dgCMatrix"
##                                      
## [1,] . 3.459432 4.392317 .        . .
## [2,] . .        .        .        . .
## [3,] . .        .        4.954196 . .

The compressed column format in dgCMatrix#

dgCMatrix is a class from the Matrix R package that implements:

general, numeric, sparse matrices in the (sorted) compressed sparse column format

This is the most common type of matrix that we will encounter when we are dealing with scRNA-seq data.

Let's make a sparse matrix in the dgCMatrix format:

library(Matrix)
m <- Matrix(nrow = 3, ncol = 6, data = 0, sparse = TRUE)
m[1,2] <- 10
m[1,3] <- 20
m[3,4] <- 30
m
## 3 x 6 sparse Matrix of class "dgCMatrix"
##                    
## [1,] . 10 20  . . .
## [2,] .  .  .  . . .
## [3,] .  .  . 30 . .

Let's look inside:

str(m)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   ..@ i       : int [1:3] 0 0 2
##   ..@ p       : int [1:7] 0 0 1 2 3 3 3
##   ..@ Dim     : int [1:2] 3 6
##   ..@ Dimnames:List of 2
##   .. ..$ : NULL
##   .. ..$ : NULL
##   ..@ x       : num [1:3] 10 20 30
##   ..@ factors : list()

The object has 6 slots, including Dim, i, x, and p.

Dim has dimensions of the matrix (3 rows, 6 columns):

m@Dim
## [1] 3 6

x has data values sorted column-wise (top to bottom, left to right):

m@x
## [1] 10 20 30

i has row indices for each data value. Note: i is 0-based, not 1-based like everything else in R.

m@i
## [1] 0 0 2

What about p? It does not tell us which column each data value us in.

m@p
## [1] 0 0 1 2 3 3 3

p has the cumulative number of data values as we move from one column to the next column, left to right. The first value is always 0, and the length of p is one more than the number of columns.

We can compute p for any matrix:

c(0, cumsum(colSums(m != 0)))
## [1] 0 0 1 2 3 3 3

Since p is a cumulative sum, we can use diff() to get the number of non-zero entries in each column:

diff(m@p)
## [1] 0 1 1 1 0 0
colSums(m != 0)
## [1] 0 1 1 1 0 0

The length of p is one more than the number of columns:

length(m@p)
## [1] 7
m@Dim[2] + 1
## [1] 7

Given p, we can compute j:

rep(1:m@Dim[2], diff(m@p))
## [1] 2 3 4

Most of the time, it's easier to use summary() to convert a sparse matrix to triplet (i, j, x) format.

summary(m)$j
## [1] 2 3 4

One more example might help to clarify how i, x, and p change as we modify the matrix:

# Add more values to the matrix
m[2,2] <- 50
m[2,3] <- 50
m[2,4] <- 50
m
## 3 x 6 sparse Matrix of class "dgCMatrix"
##                    
## [1,] . 10 20  . . .
## [2,] . 50 50 50 . .
## [3,] .  .  . 30 . .
str(m)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   ..@ i       : int [1:6] 0 1 0 1 1 2
##   ..@ p       : int [1:7] 0 0 2 4 6 6 6
##   ..@ Dim     : int [1:2] 3 6
##   ..@ Dimnames:List of 2
##   .. ..$ : NULL
##   .. ..$ : NULL
##   ..@ x       : num [1:6] 10 50 20 50 50 30
##   ..@ factors : list()

We know that p[1] is always 0.

Column 1 has 0 values, so p[2] is 0.

Column 2 has 2 values, so p[3] is 0 + 2 = 2.

Column 3 has 2 values, so p[4] is 2 + 2 = 4.

Column 4 has 2 values, so p[5] is 4 + 2 = 6.

Columns 5 and 6 have 0 values, so p[6] and p[7] are 6 + 0 = 6.

Sparse matrices use less memory than dense matrices#

# A giant matrix
set.seed(1)
m <- sparseMatrix(
  i = sample(x = 1e4, size = 1e4),
  j = sample(x = 1e4, size = 1e4),
  x = rnorm(n = 1e4)
)
pryr::object_size(m)
## 162 kB
pryr::object_size(as.matrix(m)) # Dense matrices require much more memory (RAM)
## 800 MB

Compute the sparsity of the matrix:

sparsity <- length(m@x) / m@Dim[1] / m@Dim[2]
sparsity
## [1] 1e-04

When writing Matrix Market files, remember to use gzip compression to save disk space.

writeMM(m, "matrix.mtx")

The uncompressed file size:

bytes_uncompressed <- file.size("matrix.mtx")
scales::number_bytes(bytes_uncompressed)
## [1] "281 KiB"

Compressing the file can save 50% of the disk space:

system("gzip --keep matrix.mtx")
file.size("matrix.mtx.gz") / bytes_uncompressed
## [1] 0.4883699

It takes about the same amount of time to read uncompressed or compressed Matrix Market files:

bench::mark(
  m <- readMM("matrix.mtx"),
  m <- readMM("matrix.mtx.gz")
)
## # A tibble: 2 x 6
##   expression                        min   median `itr/sec` mem_alloc `gc/sec`
##   <bch:expr>                   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 m <- readMM("matrix.mtx")      5.32ms   5.61ms      172.     626KB        0
## 2 m <- readMM("matrix.mtx.gz")   6.48ms   6.84ms      139.     626KB        0

writeMMgz#

Since the writeMM() function does not accept a connection object, this does not work:

writeMM(m, gzfile("matrix.mtx.gz")) ## This does not work :(

Instead, we can write our own function:

#' @param x A sparse matrix from the Matrix package.
#' @param file A filename that ends in ".gz".
writeMMgz <- function(x, file) {
  mtype <- "real"
  if (is(x, "ngCMatrix")) {
    mtype <- "integer"
  }
  writeLines(
    c(
      sprintf("%%%%MatrixMarket matrix coordinate %s general", mtype),
      sprintf("%s %s %s", x@Dim[1], x@Dim[2], length(x@x))
    ),
    gzfile(file)
  )
  data.table::fwrite(
    x = summary(x),
    file = file,
    append = TRUE,
    sep = " ",
    row.names = FALSE,
    col.names = FALSE
  )
}

Confirm that it works:

writeMMgz(m, "matrix2.mtx.gz")

all.equal(readMM("matrix.mtx.gz"), readMM("matrix2.mtx.gz"))
## [1] TRUE

Learn more#

Find more details about the matrix formats in this vignettes from the Matrix R package.

And learn more about faster computations with sparse matrices in this vignette.