1 Overview

The ScaledMatrix provides yet another method of running scale() on a matrix. In other words, these three operations are equivalent:

mat <- matrix(rnorm(10000), ncol=10)

smat1 <- scale(mat)
head(smat1)
##            [,1]       [,2]        [,3]        [,4]       [,5]       [,6]
## [1,] -0.4447216 -0.5061862 -0.93273239 -0.70021009 -0.1041289 -0.5140613
## [2,] -1.0760652 -1.0935948 -0.05208514  1.90192347 -1.1631869 -0.1180711
## [3,]  0.5550341  0.7504737  0.70241809  0.56785295 -0.7431139  0.8308381
## [4,]  0.4707374  0.3961490  0.05653078 -0.06971386  2.4573287  1.8473645
## [5,] -0.8563962 -0.4888608 -0.95230484 -2.21512158 -0.8127737  0.9718706
## [6,] -0.3820591 -0.1592651 -0.11079225  1.79141381 -0.1328053 -0.2551211
##            [,7]       [,8]        [,9]      [,10]
## [1,] -0.7176163 -1.2722411  0.36992388  0.9952165
## [2,]  2.1457269 -0.7885428  2.46742784 -0.9335961
## [3,]  0.5324496 -1.8942885  1.19168332  1.2511967
## [4,]  0.5901865 -0.8195014 -0.02596733 -0.1597160
## [5,]  1.1242784 -0.1772805  0.82540602 -0.5039375
## [6,] -0.2338471 -2.2569581  0.19682702 -1.2382445
library(DelayedArray)
smat2 <- scale(DelayedArray(mat))
head(smat2)
## <6 x 10> DelayedMatrix object of type "double":
##             [,1]        [,2]        [,3] ...        [,9]       [,10]
## [1,] -0.44472163 -0.50618617 -0.93273239   .  0.36992388  0.99521648
## [2,] -1.07606523 -1.09359483 -0.05208514   .  2.46742784 -0.93359614
## [3,]  0.55503407  0.75047373  0.70241809   .  1.19168332  1.25119675
## [4,]  0.47073743  0.39614900  0.05653078   . -0.02596733 -0.15971602
## [5,] -0.85639616 -0.48886079 -0.95230484   .  0.82540602 -0.50393748
## [6,] -0.38205910 -0.15926511 -0.11079225   .  0.19682702 -1.23824450
library(ScaledMatrix)
smat3 <- ScaledMatrix(mat, center=TRUE, scale=TRUE)
head(smat3)
## <6 x 10> ScaledMatrix object of type "double":
##             [,1]        [,2]        [,3] ...        [,9]       [,10]
## [1,] -0.44472163 -0.50618617 -0.93273239   .  0.36992388  0.99521648
## [2,] -1.07606523 -1.09359483 -0.05208514   .  2.46742784 -0.93359614
## [3,]  0.55503407  0.75047373  0.70241809   .  1.19168332  1.25119675
## [4,]  0.47073743  0.39614900  0.05653078   . -0.02596733 -0.15971602
## [5,] -0.85639616 -0.48886079 -0.95230484   .  0.82540602 -0.50393748
## [6,] -0.38205910 -0.15926511 -0.11079225   .  0.19682702 -1.23824450

The biggest difference lies in how they behave in downstream matrix operations.

  • smat1 is an ordinary matrix, with the scaled and centered values fully realized in memory. Nothing too unusual here.
  • smat2 is a DelayedMatrix and undergoes block processing whereby chunks are realized and operated on, one at a time. This sacrifices speed for greater memory efficiency by avoiding a copy of the entire matrix. In particular, it preserves the structure of the original mat, e.g., from a sparse or file-backed representation.
  • smat3 is a ScaledMatrix that refactors certain operations so that they can be applied to the original mat without any scaling or centering. This takes advantage of the original data structure to speed up matrix multiplication and row/column sums, albeit at the cost of numerical precision.

2 Matrix multiplication

Given an original matrix \(\mathbf{X}\) with \(n\) columns, a vector of column centers \(\mathbf{c}\) and a vector of column scaling values \(\mathbf{s}\), our scaled matrix can be written as:

\[ \mathbf{Y} = (\mathbf{X} - \mathbf{c} \cdot \mathbf{1}_n^T) \mathbf{S} \]

where \(\mathbf{S} = \text{diag}(s_1^{-1}, ..., s_n^{-1})\). If we wanted to right-multiply it with another matrix \(\mathbf{A}\), we would have:

\[ \mathbf{YA} = \mathbf{X}\mathbf{S}\mathbf{A} - \mathbf{c} \cdot \mathbf{1}_n^T \mathbf{S}\mathbf{A} \]

The right-most expression is simply the outer product of \(\mathbf{c}\) with the column sums of \(\mathbf{SA}\). More important is the fact that we can use the matrix multiplication operator for \(\mathbf{X}\) with \(\mathbf{SA}\), as this allows us to use highly efficient algorithms for certain data representations, e.g., sparse matrices.

library(Matrix)
mat <- rsparsematrix(20000, 10000, density=0.01)
smat <- ScaledMatrix(mat, center=TRUE, scale=TRUE)

blob <- matrix(runif(ncol(mat) * 5), ncol=5)
system.time(out <- smat %*% blob)
##    user  system elapsed 
##   0.021   0.004   0.025
# The slower way with block processing.
da <- scale(DelayedArray(mat))
system.time(out2 <- da %*% blob)
##    user  system elapsed 
##  27.895   8.400  37.776

The same logic applies for left-multiplication and cross-products. This allows us to easily speed up high-level operations involving matrix multiplication by just switching to a ScaledMatrix, e.g., in approximate PCA algorithms from the BiocSingular package.

library(BiocSingular)
set.seed(1000)
system.time(pcs <- runSVD(smat, k=10, BSPARAM=IrlbaParam()))
##    user  system elapsed 
##   6.031   0.313   6.344

3 Other utilities

Row and column sums are special cases of matrix multiplication and can be computed quickly:

system.time(rowSums(smat))
##    user  system elapsed 
##   0.008   0.000   0.007
system.time(rowSums(da))
##    user  system elapsed 
##  20.167   6.715  26.885

Subsetting, transposition and renaming of the dimensions are all supported without loss of the ScaledMatrix representation:

smat[,1:5]
## <20000 x 5> ScaledMatrix object of type "double":
##                   [,1]          [,2]          [,3]          [,4]          [,5]
##     [1,] -0.0008050328 20.1354266434  0.0016278980 -0.0004643620 -0.0119674312
##     [2,] -0.0008050328 -0.0092772405  0.0016278980 -0.0004643620 -0.0119674312
##     [3,] -0.0008050328 -0.0092772405  0.0016278980 -0.0004643620 -0.0119674312
##     [4,] -0.0008050328 -0.0092772405  0.0016278980 -0.0004643620 -0.0119674312
##     [5,] -0.0008050328 -0.0092772405  0.0016278980 -0.0004643620 -0.0119674312
##      ...             .             .             .             .             .
## [19996,] -0.0008050328 -0.0092772405  0.0016278980 -0.0004643620 -0.0119674312
## [19997,] -0.0008050328 -0.0092772405  0.0016278980 -0.0004643620 -0.0119674312
## [19998,] -0.0008050328 -0.0092772405  0.0016278980 -0.0004643620 -0.0119674312
## [19999,] -0.0008050328 -0.0092772405  0.0016278980 -0.0004643620 -0.0119674312
## [20000,] -0.0008050328 -0.0092772405  0.0016278980 -0.0004643620 -0.0119674312
t(smat)
## <10000 x 20000> ScaledMatrix object of type "double":
##                   [,1]          [,2]          [,3] ...      [,19999]
##     [1,] -0.0008050328 -0.0008050328 -0.0008050328   . -0.0008050328
##     [2,] 20.1354266434 -0.0092772405 -0.0092772405   . -0.0092772405
##     [3,]  0.0016278980  0.0016278980  0.0016278980   .  0.0016278980
##     [4,] -0.0004643620 -0.0004643620 -0.0004643620   . -0.0004643620
##     [5,] -0.0119674312 -0.0119674312 -0.0119674312   . -0.0119674312
##      ...             .             .             .   .             .
##  [9996,]  0.0053543847  0.0053543847  0.0053543847   .  0.0053543847
##  [9997,] -0.0020172290 -0.0020172290 -0.0020172290   . -0.0020172290
##  [9998,]  0.0002983776  0.0002983776  0.0002983776   .  0.0002983776
##  [9999,]  0.0189216911  0.0189216911  0.0189216911   .  0.0189216911
## [10000,]  0.0066926630  0.0066926630  0.0066926630   .  0.0066926630
##               [,20000]
##     [1,] -0.0008050328
##     [2,] -0.0092772405
##     [3,]  0.0016278980
##     [4,] -0.0004643620
##     [5,] -0.0119674312
##      ...             .
##  [9996,]  0.0053543847
##  [9997,] -0.0020172290
##  [9998,]  0.0002983776
##  [9999,]  7.5965352499
## [10000,]  0.0066926630
rownames(smat) <- paste0("GENE_", 1:20000)
smat
## <20000 x 10000> ScaledMatrix object of type "double":
##                     [,1]          [,2]          [,3] ...      [,9999]
##     GENE_1 -0.0008050328 20.1354266434  0.0016278980   .  0.018921691
##     GENE_2 -0.0008050328 -0.0092772405  0.0016278980   .  0.018921691
##     GENE_3 -0.0008050328 -0.0092772405  0.0016278980   .  0.018921691
##     GENE_4 -0.0008050328 -0.0092772405  0.0016278980   .  0.018921691
##     GENE_5 -0.0008050328 -0.0092772405  0.0016278980   .  0.018921691
##        ...             .             .             .   .            .
## GENE_19996 -0.0008050328 -0.0092772405  0.0016278980   .  0.018921691
## GENE_19997 -0.0008050328 -0.0092772405  0.0016278980   .  0.018921691
## GENE_19998 -0.0008050328 -0.0092772405  0.0016278980   .  0.018921691
## GENE_19999 -0.0008050328 -0.0092772405  0.0016278980   .  0.018921691
## GENE_20000 -0.0008050328 -0.0092772405  0.0016278980   .  7.596535250
##                [,10000]
##     GENE_1  0.006692663
##     GENE_2  0.006692663
##     GENE_3  0.006692663
##     GENE_4  0.006692663
##     GENE_5  0.006692663
##        ...            .
## GENE_19996 -9.475289275
## GENE_19997  0.006692663
## GENE_19998  0.006692663
## GENE_19999  0.006692663
## GENE_20000  0.006692663

Other operations will cause the ScaledMatrix to collapse to the general DelayedMatrix representation, after which point block processing will be used.

smat + 1
## <20000 x 10000> DelayedMatrix object of type "double":
##                  [,1]       [,2]       [,3] ...   [,9999]  [,10000]
##     GENE_1  0.9991950 21.1354266  1.0016279   .  1.018922  1.006693
##     GENE_2  0.9991950  0.9907228  1.0016279   .  1.018922  1.006693
##     GENE_3  0.9991950  0.9907228  1.0016279   .  1.018922  1.006693
##     GENE_4  0.9991950  0.9907228  1.0016279   .  1.018922  1.006693
##     GENE_5  0.9991950  0.9907228  1.0016279   .  1.018922  1.006693
##        ...          .          .          .   .         .         .
## GENE_19996  0.9991950  0.9907228  1.0016279   .  1.018922 -8.475289
## GENE_19997  0.9991950  0.9907228  1.0016279   .  1.018922  1.006693
## GENE_19998  0.9991950  0.9907228  1.0016279   .  1.018922  1.006693
## GENE_19999  0.9991950  0.9907228  1.0016279   .  1.018922  1.006693
## GENE_20000  0.9991950  0.9907228  1.0016279   .  8.596535  1.006693

4 Caveats

For most part, the implementation of the multiplication assumes that the \(\mathbf{A}\) matrix and the matrix product are small compared to \(\mathbf{X}\). It is also possible to multiply two ScaledMatrixes together if the underlying matrices have efficient operators for their product. However, if this is not the case, the ScaledMatrix offers little benefit for increased overhead.

It is also worth noting that this speed-up is not entirely free. The expression above involves subtracting two matrix with potentially large values, which runs the risk of catastrophic cancellation. The example below demonstrates how ScaledMatrix is more susceptible to loss of precision than a normal DelayedArray:

set.seed(1000)
mat <- matrix(rnorm(1000000), ncol=100000) 
big.mat <- mat + 1e12

# The 'correct' value, unaffected by numerical precision.
ref <- rowMeans(scale(mat))
head(ref)
## [1] -0.0025584703 -0.0008570664 -0.0019225335 -0.0001039903  0.0024761772
## [6]  0.0032943203
# The value from scale'ing a DelayedArray.
library(DelayedArray)
smat2 <- scale(DelayedArray(big.mat))
head(rowMeans(smat2))
## [1] -0.0025583534 -0.0008571123 -0.0019226040 -0.0001039539  0.0024761618
## [6]  0.0032943783
# The value from a ScaledMatrix.
library(ScaledMatrix)
smat3 <- ScaledMatrix(big.mat, center=TRUE, scale=TRUE)
head(rowMeans(smat3))
## [1] -0.00480  0.00848  0.00544 -0.00976 -0.01056  0.01520

In most practical applications, though, this does not seem to be a major concern, especially as most values (e.g., log-normalized expression matrices) lie close to zero anyway.

Session information

sessionInfo()
## R Under development (unstable) (2024-01-16 r85808)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 22.04.4 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.19-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] BiocSingular_1.19.0   ScaledMatrix_1.11.1   DelayedArray_0.29.9  
##  [4] SparseArray_1.3.4     S4Arrays_1.3.5        abind_1.4-5          
##  [7] IRanges_2.37.1        S4Vectors_0.41.3      MatrixGenerics_1.15.0
## [10] matrixStats_1.2.0     BiocGenerics_0.49.1   Matrix_1.6-5         
## [13] BiocStyle_2.31.0     
## 
## loaded via a namespace (and not attached):
##  [1] jsonlite_1.8.8            compiler_4.4.0           
##  [3] BiocManager_1.30.22       crayon_1.5.2             
##  [5] rsvd_1.0.5                Rcpp_1.0.12              
##  [7] DelayedMatrixStats_1.25.1 parallel_4.4.0           
##  [9] jquerylib_0.1.4           BiocParallel_1.37.0      
## [11] yaml_2.3.8                fastmap_1.1.1            
## [13] lattice_0.22-5            R6_2.5.1                 
## [15] XVector_0.43.1            knitr_1.45               
## [17] bookdown_0.37             bslib_0.6.1              
## [19] rlang_1.1.3               cachem_1.0.8             
## [21] xfun_0.42                 sass_0.4.8               
## [23] cli_3.6.2                 zlibbioc_1.49.0          
## [25] digest_0.6.34             grid_4.4.0               
## [27] irlba_2.3.5.1             sparseMatrixStats_1.15.0 
## [29] lifecycle_1.0.4           evaluate_0.23            
## [31] codetools_0.2-19          beachmat_2.19.1          
## [33] rmarkdown_2.25            tools_4.4.0              
## [35] htmltools_0.5.7