sleuth is a tool for the analysis and comparison of multiple related RNA-Seq experiments. Key features include:
To use sleuth, RNA-Seq data must first be quantified with kallisto, which is a program for very fast RNA-Seq quantification based on pseudo-alignment. An important feature of kallisto is that it outputs bootstraps along with the estimates of transcript abundances. These can serve as proxies for technical replicates, allowing for an ascertainment of the variability in estimates due to the random processes underlying RNA-Seq as well as the statistical procedure of read assignment. kallisto can quantify 30 million human reads in less than 3 minutes on a Mac desktop computer using only the read sequences and a transcriptome index that itself takes less than 10 minutes to build. sleuth has also been designed to be lightweight and fast, and therefore RNA-Seq analysis with kallisto and sleuth is tractable on a laptop computer in a matter of minutes.
The model sleuth uses for performing differential analysis is a general linear model where there is error in the response. Formally, in the case of two conditions being assayed, for a transcript \(t\) in a sample \(i\), the (log) “true” unobserved abundance \(y_i\) measured in read counts is modeled by
\[ y_{t,i} = \beta_{t,0} + \beta_{t,1} x_{t,i} + \epsilon_{t,i} \]
where \(x_{t,i}\) is an indicator variable describing the condition, \(\beta_{t,0}\) and \(\beta_{t,1}\) are parameters of the model and \(\epsilon_{t,i}\) is biological “noise”. However, conditioned on \(x_i\) and \(y_i\) the estimated number of counts from the observations in the experiment is given by
\[ d_{t,i} = y_{t,i} + \zeta_{t,i} \]
where \(\zeta_{t,i}\) represents technical “noise”, i.e. uncertainty in the measurement due to effects other than biological variability. The sleuth model incorporates the assumptions that the expectation \(E(\zeta_{t,i}|y_{t,i}) = 0\), that \(E(d_{t,i}) = \beta_{t,0} + \beta_{t,1} x_{t,i}\) and that the response error is additive, i.e. if $ the variance V(_{t,i}) = _t^2$ and the variance \(V(\zeta_{t,i}|y_{t,i}) = \sigma_t^2\) then the variance \(V(d_{t,i}) = \sigma_{t}^2 + \tau_{t}^2\). sleuth makes use of the boostraps from kallisto to estimate the \(\tau_t^2\), and after subtracting the estimated technical variance the \(\sigma_t^2\) are estimated via a shrinkage procedure similar to that used in Limma Voom.
More generally, sleuth can be used to analyze multiple conditions organized in complex experimental designs. In the general case the unobserved model is given by
\[ Y_t = X_t \beta_t + \epsilon_t \]
where \(t\) is a transcript, \(Y_t\) is a vector of length \(n\) where \(n\) is the number of samples, \(X_t\) is an \(n \times p\) matrix where \(p\) is the number of covariates, \(\beta_t\) represents the effect sizes and is a vector of length \(p\), and \(\epsilon_t\) is a vector of length \(n\). The observed response is described by
\[ D_t = Y_t + \zeta_t \]
where \(\zeta_t\) and \(D_t\) are vectors of length \(n\).
sleuth has been designed to facilitate the exploration of RNA-Seq data by utilizing the Shiny web application framework by RStudio. The worked example below illustrates how to load data into sleuth and how to open Shiny plots for exploratory data analysis. The code underlying all plots is available via the Shiny interface so that analyses can be fully “open source”.
To install sleuth start R and first install rhdf5
by typing:
source("http://bioconductor.org/biocLite.R")
biocLite("rhdf5")
Then install devtools by typing
install.packages("devtools")
and install sleuth by typing
devtools::install_github("pachterlab/sleuth")
Next load sleuth with
library("sleuth")
To explain how to use sleuth we provide an example based on the data in the “Cuffdiff2 paper”:
The human fibroblast RNA-Seq data for the paper is available on GEO at accession GSE37704. The samples to be analyzed are the six samples LFB_scramble_hiseq_repA, LFB_scramble_hiseq_repB, LFB_scramble_hiseq_repC, LFB_HOXA1KD_hiseq_repA, LFB_HOXA1KD_hiseq_repA, and LFB_HOXA1KD_hiseq_repC. These are three biological replicates in each of two conditions (scramble and HoxA1 knockdown) that will be compared with sleuth.
To analyze the data, first download the raw reads, install kallisto and then quantify the data with boostraps as described here. This step can be skipped for the purposes of the vignette, by downloading the kallisto processsed data directly by clicking here.
The first step in a sleuth analysis is to specify where the kallisto results are stored. Begin by storing the base directory of the results in a variable,
base_dir <- "~/Downloads/cuffdiff2_data_kallisto_results"
Next get the list of sample IDs with
sample_id <- dir(file.path(base_dir,"results"))
The result can be displayed by typing
sample_id
## [1] "SRR493366" "SRR493367" "SRR493368" "SRR493369" "SRR493370" "SRR493371"
In the box above, lines beginning with ## show the output of the command (in what follows we include the output that should appear with each command).
A list of paths to the kallisto results indexed by the sample IDs is collated with
kal_dirs <- sapply(sample_id, function(id) file.path(base_dir, "results", id, "kallisto"))
kal_dirs
## SRR493366
## "~/Downloads/cuffdiff2_data_kallisto_results/results/SRR493366/kallisto"
## SRR493367
## "~/Downloads/cuffdiff2_data_kallisto_results/results/SRR493367/kallisto"
## SRR493368
## "~/Downloads/cuffdiff2_data_kallisto_results/results/SRR493368/kallisto"
## SRR493369
## "~/Downloads/cuffdiff2_data_kallisto_results/results/SRR493369/kallisto"
## SRR493370
## "~/Downloads/cuffdiff2_data_kallisto_results/results/SRR493370/kallisto"
## SRR493371
## "~/Downloads/cuffdiff2_data_kallisto_results/results/SRR493371/kallisto"
The next step is to load an auxillary table that describes the experimental design and the relationship between the kallisto directories and the samples:
s2c <- read.table(file.path(base_dir,"hiseq_info.txt"), header = TRUE, stringsAsFactors=FALSE)
s2c <- dplyr::select(s2c, sample = run_accession, condition)
s2c
## sample condition
## 1 SRR493366 scramble
## 2 SRR493367 scramble
## 3 SRR493368 scramble
## 4 SRR493369 HOXA1KD
## 5 SRR493370 HOXA1KD
## 6 SRR493371 HOXA1KD
Now the “sleuth object” can be constructed. This requires three commands that (1) load the kallisto processed data into the object (2) estimate parameters for the sleuth response error measurement model and (3) perform differential analyis (testing). On a laptop the three steps should take about 2 minutes altogether.
First type
so <- sleuth_prep(kal_dirs, s2c, ~ condition)
## reading in kallisto results
## ......
## normalizing est_counts
## 42193 targets passed the filter
## normalizing tpm
## normalizing bootstrap samples
then
so <- sleuth_fit(so)
## summarizing bootstraps
## fitting measurement error models
## shrinkage estimation
## computing variance of betas
and finally
so <- sleuth_test(so, which_beta = 'conditionscramble')
In general, one can see the possible tests that could be performed using the which_beta
parameter in sleuth_test
and examining the coefficients:
models(so)
## [ full ]
## formula: ~condition
## coefficients:
## (Intercept)
## conditionscramble
## tests:
## conditionscramble
At this point the sleuth object constructed from the kallisto runs has information about the data, the experimental design, the kallisto estimates, the model fit, and the testing. In other words it contains the entire analysis of the data. There is, however, one piece of information that can be useful to add in, but that is optional. In reading the kallisto output sleuth has no information about genes, but this can be added allowing for searching and analysis by gene instead of transcript.
Since the example was constructed with the ENSEMBL human transcriptome, we will add gene names from ENSEMBL using biomaRt (there are other ways to do this as well):
First, install biomaRt with
source("http://bioconductor.org/biocLite.R")
biocLite("biomaRt")
Then collect gene names with
mart <- biomaRt::useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
## Creating a generic function for 'nchar' from package 'base' in package 'S4Vectors'
and add them into the sleuth table with
t2g <- biomaRt::getBM(attributes = c("ensembl_transcript_id", "ensembl_gene_id",
"external_gene_name"), mart = mart)
t2g <- dplyr::rename(t2g, target_id = ensembl_transcript_id,
ens_gene = ensembl_gene_id, ext_gene = external_gene_name)
so <- sleuth_prep(kal_dirs, s2c, ~ condition, target_mapping = t2g)
so <- sleuth_fit(so)
so <- sleuth_test(so, which_beta = 'conditionscramble')
This addition of metadata to transcript IDs is very general, and can be used to add in other information.
The best way to view the results is to generate the Shiny webpage that allows for exploratory data analysis:
sleuth_live(so)
To generate a table of results for analysis within R type
results_table <- sleuth_results(so, 'conditionscramble')