Raúl Aguirre-Gamboa and Niek de Klein
2018-11-29
DeconCell is an r package containing models for predicting the proportions of circulating immune cell subpopulations using bulk gene expression data from whole blood. Models were built using an elastic net and training in 95 healthy dutch volunteers from the 500FG cohort with FACS quantification of 73 circulating cell subpopulations as described in our previous publication. For additional details on methods and results please go our manuscript.
library(devtools)
#install_github("molgenis/systemsgenetics/Decon2/DeconCell")
Let's load and pre-process our example data. These are 5 samples with > ~40k genes quantified. These are gene read counts, we need to approximate the example data to a normal-like distribution and account for library sizes. In order to do this, we use the dCell.expProcessing
function. This function will perform a TMM normalization (as described in the edgeRpackage) a log2(counts+1) and scale (z-transformation) per gene.
library(DeconCell)
library(edgeR)
## Loading required package: limma
library(tidyverse)
## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.6
## ✔ tidyr 0.8.1 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ghibli)
data("count.table")
dCell.exp <- dCell.expProcessing(count.table, trim = TRUE)
## [INFO] Total of 95.33 % genes from dCell are found
data("dCell.models")
prediction <- dCell.predict(dCell.exp, dCell.models, res.type = "median")
head(prediction$dCell.prediction)
## Granulocytes B cells (CD19+) CD4+ T cells CD8+ T cells
## sample_1 44.91095 3.153132 25.36669 10.813036
## sample_2 53.30947 1.956458 18.57895 6.745036
## sample_3 55.40200 2.166409 16.92569 8.206814
## sample_4 46.32392 2.812778 18.36118 10.352024
## sample_5 50.43606 2.802476 23.81135 10.867829
## DN (CD4- CD8-) NK dim (CD56+ CD16+) Monocytes (CD14+) Lymphocytes
## sample_1 2.004035 5.291112 6.513887 49.23558
## sample_2 2.360521 2.729987 7.173571 39.85655
## sample_3 1.607821 5.548666 6.952072 35.70784
## sample_4 2.220237 4.314080 7.280391 46.70237
## sample_5 1.899609 4.232615 5.949453 44.31698
## CD45RO- CD45RA+ T cells CD4+ Naive CD45RA+ CD27+
## sample_1 18.01465 12.798075
## sample_2 13.06061 8.380361
## sample_3 14.06929 9.410064
## sample_4 11.51714 6.446478
## sample_5 15.79403 11.325952
## CD4+ Naive CD45RO- CD27+ Intermediate monocytes (CD14+CD16+)
## sample_1 12.011534 0.4710940
## sample_2 8.893022 0.3789746
## sample_3 9.919128 0.5770685
## sample_4 6.564064 0.5420663
## sample_5 11.708560 0.3962083
## CD8+ Naive CD45RA+ CD27+ CD8+ EM CD45RA- CD27-
## sample_1 6.073705 0.5267965
## sample_2 5.313881 0.3584265
## sample_3 4.973362 0.5235741
## sample_4 3.487296 0.7166726
## sample_5 5.746650 0.4761111
## CD8+ Naive CD45RO- CD27+ IgD+ IgM+ IgD+ IgM- IgD- IgM-
## sample_1 5.886285 1.2265014 0.10616696 0.3411916
## sample_2 4.965455 0.5653783 0.09401095 0.2959995
## sample_3 4.773495 0.7531481 0.09178040 0.2705341
## sample_4 3.053851 0.9357196 0.10162866 0.3536773
## sample_5 5.874639 1.0058478 0.10054349 0.3275862
## NaiveB cells (IgD+ IgM+ CD27-) Memory B cells (IgD+ IgM+ CD27+)
## sample_1 1.0623573 0.1551021
## sample_2 0.4593947 0.1342676
## sample_3 0.6334488 0.1274751
## sample_4 0.7733437 0.1677380
## sample_5 0.8082422 0.1734491
## CD24+ CD38+ T cells (CD3+ CD56-)
## sample_1 1.5904003 37.68790
## sample_2 0.8324206 29.79643
## sample_3 1.0491996 27.48733
## sample_4 1.2188883 31.18692
## sample_5 1.3042505 35.74774
## Transitional B cells (CD24++ CD38++)
## sample_1 0.04773371
## sample_2 0.02381854
## sample_3 0.04111926
## sample_4 0.04219308
## sample_5 0.04440786
## Naive mature B cells (CD24+ CD38+ CD27- IgM+)
## sample_1 1.0327650
## sample_2 0.4454947
## sample_3 0.6089401
## sample_4 0.7447791
## sample_5 0.7631637
## CD24+ CD38+ CD27+ IgM+ IgM-
## sample_1 0.1729348 0.3656486
## sample_2 0.1580562 0.2593598
## sample_3 0.1484528 0.2625270
## sample_4 0.1917561 0.3357884
## sample_5 0.1997702 0.2985248
## Natural effector (CD24+ CD38+ IgD+ IgM+) NK cells (CD3- CD56+)
## sample_1 0.1403407 5.695924
## sample_2 0.1235010 3.061471
## sample_3 0.1145708 5.912784
## sample_4 0.1502340 4.619413
## sample_5 0.1576725 4.601317
## IgD- CD5+ IgD+ CD5+ Prol CD4+ Tconv Prol CD4+ Treg Treg HLA-DR+
## sample_1 0.3566105 1.1540013 0.3985216 0.3633504 0.7115257
## sample_2 0.3483865 0.6902216 0.2966801 0.3701060 0.6368375
## sample_3 0.3125134 0.7004153 0.2959206 0.3084731 0.6737390
## sample_4 0.4003549 0.9634371 0.3237194 0.3684755 1.1603987
## sample_5 0.3578214 0.9811607 0.3869628 0.4391961 0.8005395
head(prediction$Evaluation)
## Granulocytes B cells (CD19+) CD4+ T cells
## 96.84721 99.00771 96.86988
## CD8+ T cells DN (CD4- CD8-) NK dim (CD56+ CD16+)
## 97.95798 100.00000 98.99515
data("cell.proportions")
library(reshape2)
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
library(ggplot2)
data("dCell.names")
pData <- data.frame(PearsonCor= diag(cor(cell.proportions, prediction$dCell.prediction)),
CTs = dCell.names[colnames(cell.proportions), "finalName"],
Subpop = dCell.names[colnames(cell.proportions), "broadSubpopulations"])
ggplot(pData, aes(y=PearsonCor , x= CTs, fill=Subpop))+
geom_bar(stat="identity", alpha=0.8)+
geom_hline(yintercept = 0.5, alpha=0.5, color="red")+
coord_flip()+
scale_fill_brewer(palette = "Dark2")+
theme_bw()+
theme(legend.position = "bottom", text=element_text(size=11, family="Helvetica"))
An important functionality of DeconCell is its capacity to generate novel model to later predict the proportions of cell types within a bulk tissue using solely gene expression derived from the bulk tissue itself. To illustrate this we will make use of the publicly available data from the DeconRNASeq package. As this package states:
"Our demo uses a simulated example data set, which can be accessed using the code given below"
library(DeconRNASeq)
## Loading required package: limSolve
##
## Attaching package: 'limSolve'
## The following object is masked from 'package:ggplot2':
##
## resolution
## Loading required package: pcaMethods
## Loading required package: Biobase
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:dplyr':
##
## combine, intersect, setdiff, union
## The following object is masked from 'package:limma':
##
## plotMA
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind,
## colMeans, colnames, colSums, dirname, do.call, duplicated,
## eval, evalq, Filter, Find, get, grep, grepl, intersect,
## is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
## paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
## Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
## table, tapply, union, unique, unsplit, which, which.max,
## which.min
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
##
## Attaching package: 'pcaMethods'
## The following object is masked from 'package:stats':
##
## loadings
## Loading required package: grid
data(multi_tissue)
## remove colums that are not needed.
datasets <- x.data[,2:11]
signatures <- x.signature.filtered.optimal[,2:6]
proportions <- fraction
exp <- datasets
As the package indicates this is "real data" which has been mixed in silico, therefore the proportions of each of the different cell types composing the "bulk" expression are known. > For the mixtures, there are 28745 genes. And we have 10 samples. In silico mixed data were simulated using ([2]) data, with disparate proportions drawn from random numbers. The mixing proportions used by each type of tissue are shown in the following. It should also be noted that we investigated the influence of extremely low numbers of contaminating cell types (<2 percent).
We will run DeconCell for each of the proportions using 60% of the samples for training our models.
set.seed(1121)
sampled.train <- sample(colnames(exp), size = 6, replace = FALSE)
#use the rest of the samples for testing the models
sampled.test <- colnames(exp)[which(colnames(exp) %in% sampled.train == FALSE)]
new.dCell.models <- dCell.run(exp = exp[,sampled.train],
proportions = proportions[sampled.train,],
iterations = 5)
## INFO A total of 5 cell types are considered
## INFO Samples indexed
##
## INFO Starting eNet iterations
##
## INFO Starting eNet iterations
##
## INFO Starting eNet iterations
##
## INFO Starting eNet iterations
##
## INFO Starting eNet iterations
Now we will use function to use the newly created models to predict the proportions on the defined test set () In the vignette, the author use the Root Mean Square Error (RMSE), which is the standard deviation from the residuals, as a measure of prediction performance.
test.prediction <- dCell.predict(exp[,sampled.test],
dCell.models= new.dCell.models$deconCell.models.per.CT,
res.type = "median", custom = TRUE)
# we use custom=TRUE to keep the original names of the proportions.
# reshape the data for plotting
pData <- reshape2::melt(as.matrix(proportions[sampled.test,]))
pData$Predicted <- reshape2::melt(as.matrix(test.prediction$dCell.prediction))$value
## Function to calculate the Root Mean Square Error
rmse.calculate <- function(x, x.pred){
sqrt(mean((x - x.pred)^2))
}
tissues <- as.character(unique(pData$Var2))
rmse.per.tissue <- sapply(tissues, function(x){rmse.calculate(pData$value[which(pData$Var2 == x)], pData$Predicted[which(pData$Var2 == x)])})
pData$RMSE <- rmse.per.tissue[as.character(pData$Var2)]
pData$RMSE <- paste0("RMSE= ", format(pData$RMSE,digits= 3))
decon.cell.tissue.plot <- ggplot(pData, aes(x= value, y=Predicted))+
geom_point(alpha= 0.9, size=1.5, aes(color= Var2))+
facet_grid(facets = ~Var2+RMSE, scales = "free")+
geom_smooth(method='lm', lwd=0.5,aes(color= Var2, alpha= 0.5))+
ylab("Decon-cell predicted \n tissue proportions")+
xlab("Tissue proportions")+
scale_color_manual(values = ghibli_palette("KikiMedium")[1:5])+
theme_bw()+
theme(text = element_text(family = "Helvetica", size = 10), legend.position = "none")
plot(decon.cell.tissue.plot)
As seen in the plot above, we have been able to accurately predict the proportions of the tissues within the bulk expression data by using the models generated with .