# Working with a sparse matrix in R

2020-03-11

Sparse matrices are necessary for dealing with large single-cell RNA-seq datasets. They require less memory than dense matrices, and they allow some computations to be more efficient. In this note, we'll discuss the internals of the dgCMatrix class with examples.

Let's get started by installing and loading the Matrix package, which provides the sparse matrix classes that we use in this note.

install.packages("Matrix")
library(Matrix)


Below, we'll explore two Matrix formats and their corresponding classes:

• The triplet format in class dgTMatrix
• The compressed column format in class dgCMatrix

# The triplet format in dgTMatrix#

dgTMatrix is a class from the Matrix R package that implements:

general, numeric, sparse matrices in (a possibly redundant) triplet format

The format is easy to understand:

• Assume all unspecified entries in the matrix are equal to zero.
• Define the non-zero entries in triplet form (i, j, x) where:
• i is the row number
• j is the column number
• x is the value

That's all there is to it. Let's make one:

m <- Matrix(nrow = 3, ncol = 6, data = 0, sparse = TRUE)
m <- as(m, "dgTMatrix") # by default, Matrix() returns dgCMatrix
m[1,2] <- 10
m[1,3] <- 20
m[3,4] <- 30
m

## 3 x 6 sparse Matrix of class "dgTMatrix"
##
## [1,] . 10 20  . . .
## [2,] .  .  .  . . .
## [3,] .  .  . 30 . .


And let's see what is inside:

str(m)

## Formal class 'dgTMatrix' [package "Matrix"] with 6 slots
##   ..@ i       : int [1:3] 0 0 2
##   ..@ j       : int [1:3] 1 2 3
##   ..@ Dim     : int [1:2] 3 6
##   ..@ Dimnames:List of 2
##   .. ..$: NULL ## .. ..$ : NULL
##   ..@ x       : num [1:3] 10 20 30
##   ..@ factors : list()


The object has slots i, j, and x.

We can reconstruct the above sparse matrix like this:

d <- data.frame(
i = m@i + 1,  # m@i is 0-based, not 1-based like everything else in R
j = m@j + 1,  # m@j is 0-based, not 1-based like everything else in R
x = m@x
)
d

##   i j  x
## 1 1 2 10
## 2 1 3 20
## 3 3 4 30

sparseMatrix(i = d$i, j = d$j, x = d$x, dims = c(3, 6))  ## 3 x 6 sparse Matrix of class "dgCMatrix" ## ## [1,] . 10 20 . . . ## [2,] . . . . . . ## [3,] . . . 30 . .  We can convert a sparse matrix to a data frame like this: as.data.frame(summary(m))  ## i j x ## 1 1 2 10 ## 2 1 3 20 ## 3 3 4 30  Since m@x gives us access to the data values, we can easily transform the values with log2(): m@x <- log2(m@x + 1)  # Matrix Market files use the triplet format# Matrix Market files often end with the file extension .mtx. Write a Matrix Market file: writeMM(m, "matrix.mtx")  ## NULL  Dump the contents of the file: readLines("matrix.mtx")  ##  "%%MatrixMarket matrix coordinate real general" ##  "3 6 3" ##  "1 2 3.4594316186372973" ##  "1 3 4.392317422778761" ##  "3 4 4.954196310386875"  This is the Matrix Market file format: • The first line is a comment (starts with %%). • The next line says there are 3 rows, 6 columns, and 3 non-zero values. • The next 3 lines describe the values in triplet format (i, j, x). Read a Matrix Market file: m <- readMM("matrix.mtx") m  ## 3 x 6 sparse Matrix of class "dgTMatrix" ## ## [1,] . 3.459432 4.392317 . . . ## [2,] . . . . . . ## [3,] . . . 4.954196 . .  Convert from dgTMatrix to dgCMatrix with: as(m, "dgCMatrix")  ## 3 x 6 sparse Matrix of class "dgCMatrix" ## ## [1,] . 3.459432 4.392317 . . . ## [2,] . . . . . . ## [3,] . . . 4.954196 . .  # The compressed column format in dgCMatrix# dgCMatrix is a class from the Matrix R package that implements: general, numeric, sparse matrices in the (sorted) compressed sparse column format This is the most common type of matrix that we will encounter when we are dealing with scRNA-seq data. Let's make a sparse matrix in the dgCMatrix format: library(Matrix) m <- Matrix(nrow = 3, ncol = 6, data = 0, sparse = TRUE) m[1,2] <- 10 m[1,3] <- 20 m[3,4] <- 30 m  ## 3 x 6 sparse Matrix of class "dgCMatrix" ## ## [1,] . 10 20 . . . ## [2,] . . . . . . ## [3,] . . . 30 . .  Let's look inside: str(m)  ## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots ## ..@ i : int [1:3] 0 0 2 ## ..@ p : int [1:7] 0 0 1 2 3 3 3 ## ..@ Dim : int [1:2] 3 6 ## ..@ Dimnames:List of 2 ## .. ..$ : NULL
##   .. ..$: NULL ## ..@ x : num [1:3] 10 20 30 ## ..@ factors : list()  The object has 6 slots, including Dim, i, x, and p. Dim has dimensions of the matrix (3 rows, 6 columns): m@Dim  ##  3 6  x has data values sorted column-wise (top to bottom, left to right): m@x  ##  10 20 30  i has row indices for each data value. Note: i is 0-based, not 1-based like everything else in R. m@i  ##  0 0 2  What about p? Unlike j, p does not tell us which column each data value us in. m@p  ##  0 0 1 2 3 3 3  p has the cumulative number of data values as we move from one column to the next column, left to right. The first value is always 0, and the length of p is one more than the number of columns. We can compute p for any matrix: c(0, cumsum(colSums(m != 0)))  ##  0 0 1 2 3 3 3  Since p is a cumulative sum, we can use diff() to get the number of non-zero entries in each column: diff(m@p)  ##  0 1 1 1 0 0  colSums(m != 0)  ##  0 1 1 1 0 0  The length of p is one more than the number of columns: length(m@p)  ##  7  m@Dim + 1  ##  7  Given p, we can compute j: rep(1:m@Dim, diff(m@p))  ##  2 3 4  Most of the time, it's easier to use summary() to convert a sparse matrix to triplet (i, j, x) format. summary(m)$j

##  2 3 4


One more example might help to clarify how i, x, and p change as we modify the matrix:

# Add more values to the matrix
m[2,2] <- 50
m[2,3] <- 50
m[2,4] <- 50
m

## 3 x 6 sparse Matrix of class "dgCMatrix"
##
## [1,] . 10 20  . . .
## [2,] . 50 50 50 . .
## [3,] .  .  . 30 . .

str(m)

## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   ..@ i       : int [1:6] 0 1 0 1 1 2
##   ..@ p       : int [1:7] 0 0 2 4 6 6 6
##   ..@ Dim     : int [1:2] 3 6
##   ..@ Dimnames:List of 2
##   .. ..$: NULL ## .. ..$ : NULL
##   ..@ x       : num [1:6] 10 50 20 50 50 30
##   ..@ factors : list()


We know that p is always 0.

Column 1 has 0 values, so p is 0.

Column 2 has 2 values, so p is 0 + 2 = 2.

Column 3 has 2 values, so p is 2 + 2 = 4.

Column 4 has 2 values, so p is 4 + 2 = 6.

Columns 5 and 6 have 0 values, so p and p are 6 + 0 = 6.

# Sparse matrices use less memory than dense matrices#

# A large matrix
set.seed(1)
m <- sparseMatrix(
i = sample(x = 1e4, size = 1e4),
j = sample(x = 1e4, size = 1e4),
x = rnorm(n = 1e4)
)
pryr::object_size(m)

## 162 kB

pryr::object_size(as.matrix(m)) # Dense matrices require much more memory (RAM)

## 800 MB


Compute the sparsity of the matrix:

sparsity <- length(m@x) / m@Dim / m@Dim
sparsity

##  1e-04


When writing Matrix Market files, remember to use gzip compression to save disk space.

writeMM(m, "matrix.mtx")


The uncompressed file size:

bytes_uncompressed <- file.size("matrix.mtx")
scales::number_bytes(bytes_uncompressed)

##  "281 KiB"


Compressing the file can save 50% of the disk space:

system("gzip --keep matrix.mtx")
file.size("matrix.mtx.gz") / bytes_uncompressed

##  0.4883699


It takes about the same amount of time to read uncompressed or compressed Matrix Market files:

bench::mark(
)

## # A tibble: 2 x 6
##   expression                        min   median itr/sec mem_alloc gc/sec
##   <bch:expr>                   <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 m <- readMM("matrix.mtx")      5.85ms   6.88ms      147.     626KB        0
## 2 m <- readMM("matrix.mtx.gz")    6.8ms   7.32ms      131.     626KB        0


# writeMMgz#

Since the writeMM() function does not accept a connection object, this does not work:

writeMM(m, gzfile("matrix.mtx.gz")) ## This does not work :(


Instead, we can write our own function:

#' @param x A sparse matrix from the Matrix package.
#' @param file A filename that ends in ".gz".
writeMMgz <- function(x, file) {
mtype <- "real"
if (is(x, "ngCMatrix")) {
mtype <- "integer"
}
writeLines(
c(
sprintf("%%%%MatrixMarket matrix coordinate %s general", mtype),
sprintf("%s %s %s", x@Dim, x@Dim, length(x@x))
),
gzfile(file)
)
data.table::fwrite(
x = summary(x),
file = file,
append = TRUE,
sep = " ",
row.names = FALSE,
col.names = FALSE
)
}


Confirm that it works:

writeMMgz(m, "matrix2.mtx.gz")


##  TRUE


# Some operations on sparse matrices are fast#

Let's make a dense copy of the 10,000 by 10,000 sparse matrix.

d <- as.matrix(m)


Recall that only 10,000 (0.01%) of the entries in this matrices are non-zero.

Many operations are much faster on sparse matrices:

bench::mark(
colSums(m),
colSums(d)
)

## # A tibble: 2 x 6
##   expression      min   median itr/sec mem_alloc gc/sec
##   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 colSums(m)  312.4µs  398.5µs    2515.      261KB        0
## 2 colSums(d)   90.9ms   91.9ms      10.9    78.2KB        0

bench::mark(
rowSums(m),
rowSums(d)
)

## # A tibble: 2 x 6
##   expression      min   median itr/sec mem_alloc gc/sec
##   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 rowSums(m)    322µs    421µs   2310.     234.6KB     2.02
## 2 rowSums(d)    161ms    164ms      6.12    78.2KB     0


Suppose we want to collapse columns by summing groups of columns according to another variable.

set.seed(1)
y <- sample(1:10, size = ncol(m), replace = TRUE)
table(y)

## y
##    1    2    3    4    5    6    7    8    9   10
##  980  937  972 1018  974  979 1072 1023 1015 1030


Let's turn the variable into a model matrix:

ymat <- model.matrix(~ 0 + factor(y))
colnames(ymat) <- 1:10

##   1 2 3 4 5 6 7 8 9 10
## 1 0 0 0 0 0 0 0 0 1  0
## 2 0 0 0 1 0 0 0 0 0  0
## 3 0 0 0 0 0 0 1 0 0  0
## 4 1 0 0 0 0 0 0 0 0  0
## 5 0 1 0 0 0 0 0 0 0  0
## 6 0 0 0 0 0 0 1 0 0  0

colSums(ymat)

##    1    2    3    4    5    6    7    8    9   10
##  980  937  972 1018  974  979 1072 1023 1015 1030


And now we can collapse the columns that belong to each group:

x1 <- m %*% ymat
x2 <- d %*% ymat
all.equal(as.matrix(x1), x2)

##  TRUE

all.equal(x1[,1], rowSums(m[,y == 1]))

##  TRUE

all.equal(x1[,2], rowSums(m[,y == 2]))

##  TRUE

dim(x1)

##  10000    10

head(x1)

## 6 x 10 Matrix of class "dgeMatrix"
##      1 2 3 4 5 6           7           8          9        10
## [1,] 0 0 0 0 0 0  0.00000000  0.00000000 -0.5578692 0.0000000
## [2,] 0 0 0 0 0 0  0.74277916  0.00000000  0.0000000 0.0000000
## [3,] 0 0 0 0 0 0  0.00000000  0.00000000  0.0000000 1.5986887
## [4,] 0 0 0 0 0 0  0.00000000  0.00000000  0.0000000 0.8402201
## [5,] 0 0 0 0 0 0  0.00000000 -0.09295838  0.0000000 0.0000000
## [6,] 0 0 0 0 0 0 -0.05341102  0.00000000  0.0000000 0.0000000


On my machine, this operation on this data is 100 times faster with a sparse matrix than with a dense matrix.

bench::mark(
m %*% ymat,
d %*% ymat,
check = FALSE
)

## # A tibble: 2 x 6
##   expression      min   median itr/sec mem_alloc gc/sec
##   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
## 1 m %*% ymat  568.4µs    851µs    972.      1.75MB     4.13
## 2 d %*% ymat   96.7ms    105ms      9.67   781.3KB     0


# R packages for working with sparse matrices#

You might consider trying these packages for working with sparse matrices in R: