scater: Single-cell analysis toolkit for expression in Rscater 1.12.2
This document gives an introduction to and overview of the quality control functionality of the scater package. scater contains tools to help with the analysis of single-cell transcriptomic data, focusing on RNA-seq data. The package features:
SingleCellExperiment class as a data container for interoperability with a wide range of other Bioconductor packages;SingleCellExperiment objectWe assume that you have a matrix containing expression count data summarised at the level of some features (gene, exon, region, etc.).
First, we create a SingleCellExperiment object containing the data, as demonstrated below with some example data ("sc_example_counts") and metadata ("sc_example_cell_info"):
Rows of the object correspond to features, while columns correspond to samples, i.e., cells in the context of single-cell ’omics data.
library(scater)
data("sc_example_counts")
data("sc_example_cell_info")
example_sce <- SingleCellExperiment(
assays = list(counts = sc_example_counts),
colData = sc_example_cell_info
)
example_sce
## class: SingleCellExperiment
## dim: 2000 40
## metadata(0):
## assays(1): counts
## rownames(2000): Gene_0001 Gene_0002 ... Gene_1999 Gene_2000
## rowData names(0):
## colnames(40): Cell_001 Cell_002 ... Cell_039 Cell_040
## colData names(4): Cell Mutation_Status Cell_Cycle Treatment
## reducedDimNames(0):
## spikeNames(0):
We usually expect (raw) count data to be labelled as "counts" in the assays, which can be easily retrieved with the counts accessor.
Getters and setters are also provided for exprs, tpm, cpm, fpkm and versions of these with the prefix norm_.
str(counts(example_sce))
Row and column-level metadata are easily accessed (or modified) as shown below.
There are also dedicated getters and setters for spike-in specifiers (isSpike); size factor values (sizeFactors); and reduced dimensionality results (reducedDim).
example_sce$whee <- sample(LETTERS, ncol(example_sce), replace=TRUE)
colData(example_sce)
## DataFrame with 40 rows and 5 columns
## Cell Mutation_Status Cell_Cycle Treatment whee
## <character> <character> <character> <character> <character>
## Cell_001 Cell_001 positive S treat1 W
## Cell_002 Cell_002 positive G0 treat1 O
## Cell_003 Cell_003 negative G1 treat1 W
## Cell_004 Cell_004 negative S treat1 Z
## Cell_005 Cell_005 negative G1 treat2 W
## ... ... ... ... ... ...
## Cell_036 Cell_036 negative G0 treat1 S
## Cell_037 Cell_037 negative G0 treat1 R
## Cell_038 Cell_038 negative G0 treat2 S
## Cell_039 Cell_039 negative G1 treat1 R
## Cell_040 Cell_040 negative G0 treat2 Q
rowData(example_sce)$stuff <- runif(nrow(example_sce))
rowData(example_sce)
## DataFrame with 2000 rows and 1 column
## stuff
## <numeric>
## Gene_0001 0.80085068452172
## Gene_0002 0.202237056568265
## Gene_0003 0.491804076358676
## Gene_0004 0.091127741150558
## Gene_0005 0.540604811627418
## ... ...
## Gene_1996 0.25665649631992
## Gene_1997 0.414263135753572
## Gene_1998 0.864102468360215
## Gene_1999 0.0602115865331143
## Gene_2000 0.778497750638053
Subsetting is very convenient with this class, as both data and metadata are processed in a synchronized manner. For example, we can filter out features (genes) that are not expressed in any cells:
keep_feature <- rowSums(counts(example_sce) > 0) > 0
example_sce <- example_sce[keep_feature,]
More details about the SingleCellExperiment class can be found in the documentation for SingleCellExperiment package.
We calculate counts-per-million using the aptly-named calculateCPM function.
The output is most appropriately stored as an assay named "cpm" in the assays of the SingleCellExperiment object.
cpm(example_sce) <- calculateCPM(example_sce)
Another option is to use the normalize function, which calculates log2-transformed normalized expression values.
This is done by dividing each count by its size factor (or scaled library size, if no size factors are defined), adding a pseudo-count and log-transforming.
The resulting values can be interpreted on the same scale as log-transformed counts, and are stored in "logcounts".
example_sce <- normalize(example_sce)
assayNames(example_sce)
## [1] "counts" "cpm" "logcounts"
Note that exprs is a synonym for logcounts when accessing or setting data.
This is done for backwards compatibility with older verions of scater.
identical(exprs(example_sce), logcounts(example_sce))
## [1] TRUE
Of course, users can construct any arbitrary matrix of the same dimensions as the count matrix and store it as an assay.
assay(example_sce, "is_expr") <- counts(example_sce)>0
The calcAverage function will compute the average count for each gene after scaling each cell’s counts by its size factor.
If size factors are not available, it will compute a size factor from the library size.
head(calcAverage(example_sce))
## Gene_0001 Gene_0002 Gene_0003 Gene_0004 Gene_0005 Gene_0006
## 305.551749 325.719897 183.090462 162.143201 1.231123 187.167913
Count matrices stored as CSV files or equivalent can be easily read into R session using read.table from utils or fread from the data.table package.
It is advisable to coerce the resulting object into a matrix before storing it in a SingleCellExperiment object.
For large data sets, the matrix can be read in chunk-by-chunk with progressive coercion into a sparse matrix from the Matrix package.
This is performed using readSparseCounts and reduces memory usage by not explicitly storing zeroes in memory.
Data from 10X Genomics experiments can be read in using the read10xCounts function from the DropletUtils package.
This will automatically generate a SingleCellExperiment with a sparse matrix, see the documentation for more details.
scater also provides wrapper functions readSalmonResults or readKallistoResults to import transcript abundances from the kallisto and Salmon pseudo-aligners.
This is done using methods from the tximport package.
SCESet classAs of July 2017, scater has switched from the SCESet class previously defined within the package to the more widely applicable SingleCellExperiment class.
From Bioconductor 3.6 (October 2017), the release version of scater will use SingleCellExperiment.
SingleCellExperiment is a more modern and robust class that provides a common data structure used by many single-cell Bioconductor packages.
Advantages include support for sparse data matrices and the capability for on-disk storage of data to minimise memory usage for large single-cell datasets.
It should be straight-forward to convert existing scripts based on SCESet objects to SingleCellExperiment objects, with key changes outlined immediately below.
toSingleCellExperiment and updateSCESet (for backwards compatibility) can be used to convert an old SCESet object to a SingleCellExperiment object;SingleCellExperiment object with the function SingleCellExperiment (actually less fiddly than creating a new SCESet);scater functions have been refactored to take SingleCellExperiment objects, so once data is in a SingleCellExperiment object, the user experience is almost identical to that with the SCESet class.Users may need to be aware of the following when updating their own scripts:
colnames function (instead of sampleNames or cellNames for an SCESet object);rownames function (instead of featureNames);phenoData in an SCESet, corresponds to colData in a SingleCellExperiment object and is accessed/assigned with the colData function (this replaces the pData function);$ operator (e.g. sce$total_counts);featureData in an SCESet, corresponds to rowData in a SingleCellExperiment object and is accessed/assigned with the rowData function (this replaces the fData function);plotScater, which produces a cumulative expression, overview plot, replaces
the generic plot function for SCESet objects.sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.9-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.9-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] scater_1.12.2 SingleCellExperiment_1.6.0
## [3] SummarizedExperiment_1.14.0 DelayedArray_0.10.0
## [5] BiocParallel_1.18.0 matrixStats_0.54.0
## [7] Biobase_2.44.0 GenomicRanges_1.36.0
## [9] GenomeInfoDb_1.20.0 IRanges_2.18.0
## [11] S4Vectors_0.22.0 BiocGenerics_0.30.0
## [13] ggplot2_3.1.1 knitr_1.23
## [15] BiocStyle_2.12.0
##
## loaded via a namespace (and not attached):
## [1] bitops_1.0-6 destiny_2.14.0
## [3] xts_0.11-2 tools_3.6.0
## [5] R6_2.4.0 irlba_2.3.3
## [7] vipor_0.4.5 lazyeval_0.2.2
## [9] colorspace_1.4-1 nnet_7.3-12
## [11] withr_2.1.2 sp_1.3-1
## [13] smoother_1.1 tidyselect_0.2.5
## [15] gridExtra_2.3 curl_3.3
## [17] compiler_3.6.0 BiocNeighbors_1.2.0
## [19] labeling_0.3 bookdown_0.10
## [21] scales_1.0.0 lmtest_0.9-37
## [23] DEoptimR_1.0-8 robustbase_0.93-5
## [25] proxy_0.4-23 stringr_1.4.0
## [27] digest_0.6.19 foreign_0.8-71
## [29] rmarkdown_1.13 rio_0.5.16
## [31] XVector_0.24.0 pkgconfig_2.0.2
## [33] htmltools_0.3.6 TTR_0.23-4
## [35] ggthemes_4.2.0 rlang_0.3.4
## [37] readxl_1.3.1 DelayedMatrixStats_1.6.0
## [39] zoo_1.8-5 dplyr_0.8.1
## [41] zip_2.0.2 car_3.0-2
## [43] RCurl_1.95-4.12 magrittr_1.5
## [45] BiocSingular_1.0.0 GenomeInfoDbData_1.2.1
## [47] Matrix_1.2-17 Rcpp_1.0.1
## [49] ggbeeswarm_0.6.0 munsell_0.5.0
## [51] abind_1.4-5 viridis_0.5.1
## [53] scatterplot3d_0.3-41 stringi_1.4.3
## [55] yaml_2.2.0 carData_3.0-2
## [57] MASS_7.3-51.4 zlibbioc_1.30.0
## [59] Rtsne_0.15 plyr_1.8.4
## [61] grid_3.6.0 forcats_0.4.0
## [63] crayon_1.3.4 lattice_0.20-38
## [65] haven_2.1.0 cowplot_0.9.4
## [67] hms_0.4.2 pillar_1.4.0
## [69] igraph_1.2.4.1 ranger_0.11.2
## [71] boot_1.3-22 reshape2_1.4.3
## [73] glue_1.3.1 evaluate_0.13
## [75] laeken_0.5.0 data.table_1.12.2
## [77] BiocManager_1.30.4 vcd_1.4-4
## [79] VIM_4.8.0 cellranger_1.1.0
## [81] gtable_0.3.0 purrr_0.3.2
## [83] assertthat_0.2.1 xfun_0.7
## [85] openxlsx_4.1.0 rsvd_1.0.0
## [87] RcppEigen_0.3.3.5.0 e1071_1.7-1
## [89] class_7.3-15 viridisLite_0.3.0
## [91] tibble_2.1.1 beeswarm_0.2.3