scmap package vignetteAs more and more scRNA-seq datasets become available, carrying out comparisons between them is key. A central application is to compare datasets of similar biological origin collected by different labs to ensure that the annotation and the analysis is consistent. Moreover, as very large references, e.g. the Human Cell Atlas (HCA), become available, an important application will be to project cells from a new sample (e.g. from a disease tissue) onto the reference to characterize differences in composition, or to detect new cell-types.
scmap is a method for projecting cells from a scRNA-seq experiment on to the cell-types or cells identified in a different experiment. A copy of the scmap manuscript is available on bioRxiv.
SingleCellExperiment classscmap is built on top of the Bioconductor’s SingleCellExperiment class. Please read corresponding vignettes on how to create a SingleCellExperiment from your own data. Here we will show a small example on how to do that but note that it is not a comprehensive guide.
scmap inputIf you already have a SingleCellExperiment object, then proceed to the next chapter.
If you have a matrix or a data frame containing expression data then you first need to create an SingleCellExperiment object containing your data. For illustrative purposes we will use an example expression matrix provided with scmap. The dataset (yan) represents FPKM gene expression of 90 cells derived from human embryo. The authors (Yan et al.) have defined developmental stages of all cells in the original publication (ann data frame). We will use these stages in projection later.
library(SingleCellExperiment)
library(scmap)
head(ann)
## cell_type1
## Oocyte..1.RPKM. zygote
## Oocyte..2.RPKM. zygote
## Oocyte..3.RPKM. zygote
## Zygote..1.RPKM. zygote
## Zygote..2.RPKM. zygote
## Zygote..3.RPKM. zygote
yan[1:3, 1:3]
## Oocyte..1.RPKM. Oocyte..2.RPKM. Oocyte..3.RPKM.
## C9orf152 0.0 0.0 0.0
## RPS11 1219.9 1021.1 931.6
## ELMO2 7.0 12.2 9.3
Note that the cell type information has to be stored in the cell_type1 column of the rowData slot of the SingleCellExperiment object.
Now let’s create a SingleCellExperiment object of the yan dataset:
sce <- SingleCellExperiment(assays = list(normcounts = as.matrix(yan)), colData = ann)
logcounts(sce) <- log2(normcounts(sce) + 1)
# use gene names as feature symbols
rowData(sce)$feature_symbol <- rownames(sce)
isSpike(sce, "ERCC") <- grepl("^ERCC-", rownames(sce))
## Warning: 'isSpike<-' is deprecated.
## Use 'isSpike<-' instead.
## See help("Deprecated")
## Warning: 'spikeNames' is deprecated.
## See help("Deprecated")
## Warning: 'isSpike' is deprecated.
## See help("Deprecated")
# remove features with duplicated names
sce <- sce[!duplicated(rownames(sce)), ]
sce
## class: SingleCellExperiment
## dim: 20214 90
## metadata(0):
## assays(2): normcounts logcounts
## rownames(20214): C9orf152 RPS11 ... CTSC AQP7
## rowData names(1): feature_symbol
## colnames(90): Oocyte..1.RPKM. Oocyte..2.RPKM. ...
## Late.blastocyst..3..Cell.7.RPKM. Late.blastocyst..3..Cell.8.RPKM.
## colData names(1): cell_type1
## reducedDimNames(0):
## spikeNames(1): ERCC
## altExpNames(0):
Once we have a SingleCellExperiment object we can run scmap. Firstly, we need to select the most informative features (genes) from our input dataset:
sce <- selectFeatures(sce, suppress_plot = FALSE)
## Warning in linearModel(object, n_features): Your object does not contain
## counts() slot. Dropouts were calculated using logcounts() slot...
## Warning: 'isSpike' is deprecated.
## See help("Deprecated")
## Warning: 'spikeNames' is deprecated.
## See help("Deprecated")
## Warning: 'isSpike' is deprecated.
## See help("Deprecated")
Features highlighted with the red colour will be used in the futher analysis (projection).
Features are stored in the scmap_features column of the rowData slot of the input object. By default scmap selects \(500\) features (it can also be controlled by setting n_features parameter):
table(rowData(sce)$scmap_features)
##
## FALSE TRUE
## 19714 500
The scmap-cluster index of a reference dataset is created by finding the median gene expression for each cluster. By default scmap uses the cell_type1 column of the colData slot in the reference to identify clusters. Other columns can be manually selected by adjusting cluster_col parameter:
sce <- indexCluster(sce)
The function indexCluster automatically writes the scmap_cluster_index item of the metadata slot of the reference dataset.
head(metadata(sce)$scmap_cluster_index)
## zygote 2cell 4cell 8cell 16cell blast
## ABCB4 5.788589 6.2258580 5.935134 0.6667119 0.000000 0.000000
## ABCC6P1 7.863625 7.7303559 8.322769 7.4303689 4.759867 0.000000
## ABT1 0.320773 0.1315172 0.000000 5.9787977 6.100671 4.627798
## ACCSL 7.922318 8.4274290 9.662611 4.5869260 1.768026 0.000000
## ACOT11 0.000000 0.0000000 0.000000 6.4677243 7.147798 4.057444
## ACOT9 4.877394 4.2196038 5.446969 4.0685468 3.827819 0.000000
One can also visualise the index:
heatmap(as.matrix(metadata(sce)$scmap_cluster_index))
Once the scmap-cluster index has been generated we can use it to project our dataset to itself (just for illustrative purposes). This can be done with one index at a time, but scmap also allows for simultaneous projection to multiple indexes if they are provided as a list:
scmapCluster_results <- scmapCluster(
projection = sce,
index_list = list(
yan = metadata(sce)$scmap_cluster_index
)
)
scmap-cluster projects the query dataset to all projections defined in the index_list. The results of cell label assignements are merged into one matrix:
head(scmapCluster_results$scmap_cluster_labs)
## yan
## [1,] "zygote"
## [2,] "zygote"
## [3,] "zygote"
## [4,] "2cell"
## [5,] "2cell"
## [6,] "2cell"
Corresponding similarities are stored in the scmap_cluster_siml item:
head(scmapCluster_results$scmap_cluster_siml)
## yan
## [1,] 0.9947609
## [2,] 0.9951257
## [3,] 0.9955916
## [4,] 0.9934012
## [5,] 0.9953694
## [6,] 0.9871041
scmap also provides combined results of all reference dataset (choose labels corresponding to the largest similarity across reference datasets):
head(scmapCluster_results$combined_labs)
## [1] "zygote" "zygote" "zygote" "2cell" "2cell" "2cell"
The results of scmap-cluster can be visualized as a Sankey diagram to show how cell-clusters are matched (getSankey() function). Note that the Sankey diagram will only be informative if both the query and the reference datasets have been clustered, but it is not necessary to have meaningful labels assigned to the query (cluster1, cluster2 etc. is sufficient):
plot(
getSankey(
colData(sce)$cell_type1,
scmapCluster_results$scmap_cluster_labs[,'yan'],
plot_height = 400
)
)
In contrast to scmap-cluster, scmap-cell projects cells of the input dataset to the individual cells of the reference and not to the cell clusters.
scmap-cell contains k-means step which makes it stochastic, i.e. running it multiple times will provide slightly different results. Therefore, we will fix a random seed, so that a user will be able to exactly reproduce our results:
set.seed(1)
In the scmap-cell index is created by a product quantiser algorithm in a way that every cell in the reference is identified with a set of sub-centroids found via k-means clustering based on a subset of the features.
sce <- indexCell(sce)
Unlike scmap-cluster index scmap-cell index contains information about each cell and therefore can not be easily visualised. scmap-cell index consists of two items:
names(metadata(sce)$scmap_cell_index)
## [1] "subcentroids" "subclusters"
subcentroids contains coordinates of subcentroids of low dimensional subspaces defined by selected features, k and M parameters of the product quantiser algorithm (see ?indexCell).
length(metadata(sce)$scmap_cell_index$subcentroids)
## [1] 50
dim(metadata(sce)$scmap_cell_index$subcentroids[[1]])
## [1] 10 9
metadata(sce)$scmap_cell_index$subcentroids[[1]][,1:5]
## 1 2 3 4 5
## ZAR1L 0.072987697 0.2848353 0.33713297 0.26694708 0.3051086
## SERPINF1 0.179135680 0.3784345 0.35886481 0.39453521 0.4326297
## GRB2 0.439712934 0.4246024 0.23308320 0.43238208 0.3247221
## GSTP1 0.801498298 0.1464230 0.14880665 0.19900079 0.0000000
## ABCC6P1 0.005544482 0.4358565 0.46276591 0.40280401 0.3989602
## ARGFX 0.341212258 0.4284664 0.07629512 0.47961460 0.1296112
## DCT 0.004323311 0.1943568 0.32117489 0.21259776 0.3836451
## C15orf60 0.006681366 0.1862540 0.28346531 0.01123282 0.1096438
## SVOPL 0.003004345 0.1548237 0.33551596 0.12691677 0.2525819
## NLRP9 0.101524942 0.3223963 0.40624639 0.30465156 0.4640308
In the case of our yan dataset:
yan dataset contains \(N = 90\) cellsscmap default)M was calculated as \(f / 10 = 50\) (scmap default for \(f \le 1000\)). M is the number of low dimensional subspacesk was calculated as \(k = \sqrt{N} \approx 9\) (scmap default).subclusters contains for every low dimensial subspace indexies of subcentroids which a given cell belongs to:
dim(metadata(sce)$scmap_cell_index$subclusters)
## [1] 50 90
metadata(sce)$scmap_cell_index$subclusters[1:5,1:5]
## Oocyte..1.RPKM. Oocyte..2.RPKM. Oocyte..3.RPKM. Zygote..1.RPKM.
## [1,] 6 6 6 6
## [2,] 5 5 5 5
## [3,] 5 5 5 5
## [4,] 3 3 3 3
## [5,] 6 6 6 6
## Zygote..2.RPKM.
## [1,] 6
## [2,] 5
## [3,] 5
## [4,] 3
## [5,] 6
Once the scmap-cell indexes have been generated we can use them to project the baron dataset. This can be done with one index at a time, but scmap allows for simultaneous projection to multiple indexes if they are provided as a list:
scmapCell_results <- scmapCell(
sce,
list(
yan = metadata(sce)$scmap_cell_index
)
)
scmapCell_results contains results of projection for each reference dataset in a list:
names(scmapCell_results)
## [1] "yan"
For each dataset there are two matricies. cells matrix contains the top 10 (scmap default) cell IDs of the cells of the reference dataset that a given cell of the projection dataset is closest to:
scmapCell_results$yan$cells[,1:3]
## Oocyte..1.RPKM. Oocyte..2.RPKM. Oocyte..3.RPKM.
## [1,] 1 1 1
## [2,] 2 2 2
## [3,] 3 3 3
## [4,] 11 11 11
## [5,] 5 5 5
## [6,] 6 6 6
## [7,] 7 7 7
## [8,] 12 8 12
## [9,] 9 9 9
## [10,] 10 10 10
similarities matrix contains corresponding cosine similarities:
scmapCell_results$yan$similarities[,1:3]
## Oocyte..1.RPKM. Oocyte..2.RPKM. Oocyte..3.RPKM.
## [1,] 0.9742737 0.9736593 0.9748542
## [2,] 0.9742274 0.9737083 0.9748995
## [3,] 0.9742274 0.9737083 0.9748995
## [4,] 0.9693955 0.9684169 0.9697731
## [5,] 0.9698173 0.9688538 0.9701976
## [6,] 0.9695394 0.9685904 0.9699759
## [7,] 0.9694336 0.9686058 0.9699198
## [8,] 0.9694091 0.9684312 0.9697699
## [9,] 0.9692544 0.9684312 0.9697358
## [10,] 0.9694336 0.9686058 0.9699198
If cell cluster annotation is available for the reference datasets, in addition to finding top 10 nearest neighbours scmap-cell also allows to annotate cells of the projection dataset using labels of the reference. It does so by looking at the top 3 nearest neighbours (scmap default) and if they all belong to the same cluster in the reference and their maximum similarity is higher than a threshold (\(0.5\) is the scmap default) a projection cell is assigned to a corresponding reference cluster:
scmapCell_clusters <- scmapCell2Cluster(
scmapCell_results,
list(
as.character(colData(sce)$cell_type1)
)
)
scmap-cell results are in the same format as the ones provided by scmap-cluster (see above):
head(scmapCell_clusters$scmap_cluster_labs)
## yan
## [1,] "zygote"
## [2,] "zygote"
## [3,] "zygote"
## [4,] "unassigned"
## [5,] "unassigned"
## [6,] "unassigned"
Corresponding similarities are stored in the scmap_cluster_siml item:
head(scmapCell_clusters$scmap_cluster_siml)
## yan
## [1,] 0.9742737
## [2,] 0.9737083
## [3,] 0.9748995
## [4,] NA
## [5,] NA
## [6,] NA
head(scmapCell_clusters$combined_labs)
## [1] "zygote" "zygote" "zygote" "unassigned" "unassigned"
## [6] "unassigned"
plot(
getSankey(
colData(sce)$cell_type1,
scmapCell_clusters$scmap_cluster_labs[,"yan"],
plot_height = 400
)
)
## R version 3.6.1 (2019-07-05)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.10-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.10-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] scmap_1.8.0 SingleCellExperiment_1.8.0
## [3] SummarizedExperiment_1.16.0 DelayedArray_0.12.0
## [5] BiocParallel_1.20.0 matrixStats_0.55.0
## [7] Biobase_2.46.0 GenomicRanges_1.38.0
## [9] GenomeInfoDb_1.22.0 IRanges_2.20.0
## [11] S4Vectors_0.24.0 BiocGenerics_0.32.0
## [13] googleVis_0.6.4 BiocStyle_2.14.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_0.2.5 xfun_0.10 reshape2_1.4.3
## [4] purrr_0.3.3 lattice_0.20-38 colorspace_1.4-1
## [7] htmltools_0.4.0 yaml_2.2.0 rlang_0.4.1
## [10] e1071_1.7-2 pillar_1.4.2 glue_1.3.1
## [13] plyr_1.8.4 GenomeInfoDbData_1.2.2 stringr_1.4.0
## [16] zlibbioc_1.32.0 munsell_0.5.0 gtable_0.3.0
## [19] codetools_0.2-16 evaluate_0.14 labeling_0.3
## [22] knitr_1.25 class_7.3-15 Rcpp_1.0.2
## [25] scales_1.0.0 BiocManager_1.30.9 jsonlite_1.6
## [28] XVector_0.26.0 ggplot2_3.2.1 digest_0.6.22
## [31] stringi_1.4.3 bookdown_0.14 dplyr_0.8.3
## [34] grid_3.6.1 tools_3.6.1 bitops_1.0-6
## [37] magrittr_1.5 proxy_0.4-23 lazyeval_0.2.2
## [40] RCurl_1.95-4.12 tibble_2.1.3 randomForest_4.6-14
## [43] crayon_1.3.4 pkgconfig_2.0.3 Matrix_1.2-17
## [46] assertthat_0.2.1 rmarkdown_1.16 R6_2.4.0
## [49] compiler_3.6.1