DeconCell quick start and tutorial

Raúl Aguirre-Gamboa and Niek de Klein
2018-11-29

Introduction

DeconCell is an r package containing models for predicting the proportions of circulating immune cell subpopulations using bulk gene expression data from whole blood. Models were built using an elastic net and training in 95 healthy dutch volunteers from the 500FG cohort with FACS quantification of 73 circulating cell subpopulations as described in our previous publication. For additional details on methods and results please go our manuscript.

Install the package from github

library(devtools)
#install_github("molgenis/systemsgenetics/Decon2/DeconCell")

Pre-processing example data

Let's load and pre-process our example data. These are 5 samples with > ~40k genes quantified. These are gene read counts, we need to approximate the example data to a normal-like distribution and account for library sizes. In order to do this, we use the dCell.expProcessing function. This function will perform a TMM normalization (as described in the edgeRpackage) a log2(counts+1) and scale (z-transformation) per gene.

library(DeconCell)
library(edgeR)
## Loading required package: limma
library(tidyverse)
## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.0.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.6
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0

## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(ghibli)


data("count.table")
dCell.exp <- dCell.expProcessing(count.table, trim = TRUE)
## [INFO]    Total of 95.33 % genes from dCell are found

Prediction of cell propotions

data("dCell.models")
prediction <- dCell.predict(dCell.exp, dCell.models, res.type = "median")
head(prediction$dCell.prediction)
##          Granulocytes B cells (CD19+) CD4+ T cells CD8+ T cells
## sample_1     44.91095        3.153132     25.36669    10.813036
## sample_2     53.30947        1.956458     18.57895     6.745036
## sample_3     55.40200        2.166409     16.92569     8.206814
## sample_4     46.32392        2.812778     18.36118    10.352024
## sample_5     50.43606        2.802476     23.81135    10.867829
##          DN (CD4- CD8-) NK dim (CD56+ CD16+) Monocytes (CD14+) Lymphocytes
## sample_1       2.004035             5.291112          6.513887    49.23558
## sample_2       2.360521             2.729987          7.173571    39.85655
## sample_3       1.607821             5.548666          6.952072    35.70784
## sample_4       2.220237             4.314080          7.280391    46.70237
## sample_5       1.899609             4.232615          5.949453    44.31698
##          CD45RO- CD45RA+ T cells CD4+ Naive CD45RA+ CD27+
## sample_1                18.01465                12.798075
## sample_2                13.06061                 8.380361
## sample_3                14.06929                 9.410064
## sample_4                11.51714                 6.446478
## sample_5                15.79403                11.325952
##          CD4+ Naive CD45RO- CD27+ Intermediate monocytes (CD14+CD16+)
## sample_1                12.011534                           0.4710940
## sample_2                 8.893022                           0.3789746
## sample_3                 9.919128                           0.5770685
## sample_4                 6.564064                           0.5420663
## sample_5                11.708560                           0.3962083
##          CD8+ Naive CD45RA+ CD27+ CD8+ EM CD45RA- CD27-
## sample_1                 6.073705             0.5267965
## sample_2                 5.313881             0.3584265
## sample_3                 4.973362             0.5235741
## sample_4                 3.487296             0.7166726
## sample_5                 5.746650             0.4761111
##          CD8+ Naive CD45RO- CD27+ IgD+ IgM+  IgD+ IgM- IgD- IgM-
## sample_1                 5.886285 1.2265014 0.10616696 0.3411916
## sample_2                 4.965455 0.5653783 0.09401095 0.2959995
## sample_3                 4.773495 0.7531481 0.09178040 0.2705341
## sample_4                 3.053851 0.9357196 0.10162866 0.3536773
## sample_5                 5.874639 1.0058478 0.10054349 0.3275862
##          NaiveB cells (IgD+ IgM+ CD27-) Memory B cells (IgD+ IgM+ CD27+)
## sample_1                      1.0623573                        0.1551021
## sample_2                      0.4593947                        0.1342676
## sample_3                      0.6334488                        0.1274751
## sample_4                      0.7733437                        0.1677380
## sample_5                      0.8082422                        0.1734491
##          CD24+ CD38+ T cells (CD3+ CD56-)
## sample_1   1.5904003             37.68790
## sample_2   0.8324206             29.79643
## sample_3   1.0491996             27.48733
## sample_4   1.2188883             31.18692
## sample_5   1.3042505             35.74774
##          Transitional B cells (CD24++ CD38++)
## sample_1                           0.04773371
## sample_2                           0.02381854
## sample_3                           0.04111926
## sample_4                           0.04219308
## sample_5                           0.04440786
##          Naive mature B cells (CD24+ CD38+ CD27- IgM+)
## sample_1                                     1.0327650
## sample_2                                     0.4454947
## sample_3                                     0.6089401
## sample_4                                     0.7447791
## sample_5                                     0.7631637
##          CD24+ CD38+ CD27+ IgM+      IgM-
## sample_1              0.1729348 0.3656486
## sample_2              0.1580562 0.2593598
## sample_3              0.1484528 0.2625270
## sample_4              0.1917561 0.3357884
## sample_5              0.1997702 0.2985248
##          Natural effector (CD24+ CD38+ IgD+ IgM+)  NK cells (CD3- CD56+)
## sample_1                                 0.1403407              5.695924
## sample_2                                 0.1235010              3.061471
## sample_3                                 0.1145708              5.912784
## sample_4                                 0.1502340              4.619413
## sample_5                                 0.1576725              4.601317
##          IgD- CD5+ IgD+ CD5+ Prol CD4+ Tconv Prol CD4+ Treg Treg HLA-DR+
## sample_1 0.3566105 1.1540013       0.3985216      0.3633504    0.7115257
## sample_2 0.3483865 0.6902216       0.2966801      0.3701060    0.6368375
## sample_3 0.3125134 0.7004153       0.2959206      0.3084731    0.6737390
## sample_4 0.4003549 0.9634371       0.3237194      0.3684755    1.1603987
## sample_5 0.3578214 0.9811607       0.3869628      0.4391961    0.8005395
head(prediction$Evaluation)
##         Granulocytes      B cells (CD19+)         CD4+ T cells 
##             96.84721             99.00771             96.86988 
##         CD8+ T cells       DN (CD4- CD8-) NK dim (CD56+ CD16+) 
##             97.95798            100.00000             98.99515

Correlation coeficient between of predicted and measured values

data("cell.proportions")
library(reshape2)
## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths
library(ggplot2)
data("dCell.names")
pData <- data.frame(PearsonCor= diag(cor(cell.proportions, prediction$dCell.prediction)), 
                    CTs = dCell.names[colnames(cell.proportions), "finalName"], 
                    Subpop = dCell.names[colnames(cell.proportions), "broadSubpopulations"])

ggplot(pData, aes(y=PearsonCor , x= CTs, fill=Subpop))+
  geom_bar(stat="identity", alpha=0.8)+
  geom_hline(yintercept = 0.5, alpha=0.5, color="red")+
  coord_flip()+
  scale_fill_brewer(palette = "Dark2")+
  theme_bw()+
  theme(legend.position = "bottom", text=element_text(size=11, family="Helvetica"))

Generate deconCell models for predicting new cell proportions using gene expression data.

An important functionality of DeconCell is its capacity to generate novel model to later predict the proportions of cell types within a bulk tissue using solely gene expression derived from the bulk tissue itself. To illustrate this we will make use of the publicly available data from the DeconRNASeq package. As this package states:

"Our demo uses a simulated example data set, which can be accessed using the code given below"

library(DeconRNASeq)
## Loading required package: limSolve

## 
## Attaching package: 'limSolve'

## The following object is masked from 'package:ggplot2':
## 
##     resolution

## Loading required package: pcaMethods

## Loading required package: Biobase

## Loading required package: BiocGenerics

## Loading required package: parallel

## 
## Attaching package: 'BiocGenerics'

## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB

## The following objects are masked from 'package:dplyr':
## 
##     combine, intersect, setdiff, union

## The following object is masked from 'package:limma':
## 
##     plotMA

## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs

## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind,
##     colMeans, colnames, colSums, dirname, do.call, duplicated,
##     eval, evalq, Filter, Find, get, grep, grepl, intersect,
##     is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
##     paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
##     Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
##     table, tapply, union, unique, unsplit, which, which.max,
##     which.min

## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.

## 
## Attaching package: 'pcaMethods'

## The following object is masked from 'package:stats':
## 
##     loadings

## Loading required package: grid
data(multi_tissue)

## remove colums that are not needed.
datasets <- x.data[,2:11]
signatures <- x.signature.filtered.optimal[,2:6]
proportions <- fraction
exp <- datasets

As the package indicates this is "real data" which has been mixed in silico, therefore the proportions of each of the different cell types composing the "bulk" expression are known. > For the mixtures, there are 28745 genes. And we have 10 samples. In silico mixed data were simulated using ([2]) data, with disparate proportions drawn from random numbers. The mixing proportions used by each type of tissue are shown in the following. It should also be noted that we investigated the influence of extremely low numbers of contaminating cell types (<2 percent).

We will run DeconCell for each of the proportions using 60% of the samples for training our models.

set.seed(1121)
sampled.train <- sample(colnames(exp), size = 6, replace = FALSE)
#use the rest of the samples for testing the models
sampled.test <- colnames(exp)[which(colnames(exp) %in% sampled.train == FALSE)]

new.dCell.models <- dCell.run(exp = exp[,sampled.train], 
                              proportions = proportions[sampled.train,], 
                              iterations = 5)
## INFO  A total of  5 cell types are considered 
## INFO Samples indexed 
## 
## INFO  Starting eNet iterations 
## 
## INFO  Starting eNet iterations 
## 
## INFO  Starting eNet iterations 
## 
## INFO  Starting eNet iterations 
## 
## INFO  Starting eNet iterations

Testing the prediction of the new models.

Now we will use function to use the newly created models to predict the proportions on the defined test set () In the vignette, the author use the Root Mean Square Error (RMSE), which is the standard deviation from the residuals, as a measure of prediction performance.

test.prediction <- dCell.predict(exp[,sampled.test],
                                 dCell.models= new.dCell.models$deconCell.models.per.CT, 
                                 res.type = "median", custom = TRUE)
# we use custom=TRUE to keep the original names of the proportions.

# reshape the data for plotting 
pData <- reshape2::melt(as.matrix(proportions[sampled.test,]))
pData$Predicted <- reshape2::melt(as.matrix(test.prediction$dCell.prediction))$value

## Function to calculate the Root Mean Square Error
rmse.calculate <- function(x, x.pred){
  sqrt(mean((x - x.pred)^2))
}

tissues <- as.character(unique(pData$Var2))
rmse.per.tissue <- sapply(tissues, function(x){rmse.calculate(pData$value[which(pData$Var2 == x)], pData$Predicted[which(pData$Var2 == x)])})

pData$RMSE <- rmse.per.tissue[as.character(pData$Var2)]
pData$RMSE <- paste0("RMSE= ", format(pData$RMSE,digits= 3))

decon.cell.tissue.plot <- ggplot(pData, aes(x= value, y=Predicted))+
                          geom_point(alpha= 0.9, size=1.5, aes(color= Var2))+
                          facet_grid(facets = ~Var2+RMSE, scales = "free")+
                          geom_smooth(method='lm', lwd=0.5,aes(color= Var2, alpha= 0.5))+
                          ylab("Decon-cell predicted \n tissue proportions")+
                          xlab("Tissue proportions")+
                          scale_color_manual(values = ghibli_palette("KikiMedium")[1:5])+
                          theme_bw()+
                          theme(text = element_text(family = "Helvetica", size = 10), legend.position = "none")


plot(decon.cell.tissue.plot)

As seen in the plot above, we have been able to accurately predict the proportions of the tissues within the bulk expression data by using the models generated with .