fgsea is an R-package for fast preranked gene set enrichment analysis (GSEA). The performance is achieved by using an algorithm for cumulative GSEA-statistic calculation. This allows to reuse samples between different gene set sizes. See the preprint for algorithmic details.
Loading example pathways and gene-level statistics:
data(examplePathways)
data(exampleRanks)Running fgsea:
fgseaRes <- fgsea(pathways = examplePathways,
stats = exampleRanks,
minSize=15,
maxSize=500,
nperm=10000)The resulting table contains enrichment scores and p-values:
head(fgseaRes[order(pval), ])## pathway pval padj ES
## 1: 5990980_Cell_Cycle 0.0001241157 0.001931221 0.5388497
## 2: 5990979_Cell_Cycle,_Mitotic 0.0001270810 0.001931221 0.5594755
## 3: 5991210_Signaling_by_Rho_GTPases 0.0001328904 0.001931221 0.4238512
## 4: 5991454_M_Phase 0.0001379501 0.001931221 0.5576247
## 5: 5991023_Metabolism_of_carbohydrates 0.0001393534 0.001931221 0.4944766
## 6: 5991209_RHO_GTPase_Effectors 0.0001396258 0.001931221 0.5248796
## NES nMoreExtreme size leadingEdge
## 1: 2.678919 0 369 66336,66977,12442,107995,66442,19361,
## 2: 2.742692 0 317 66336,66977,12442,107995,66442,12571,
## 3: 2.011171 0 231 66336,66977,20430,104215,233406,107995,
## 4: 2.552523 0 173 66336,66977,12442,107995,66442,52276,
## 5: 2.237706 0 160 11676,21991,15366,58250,12505,20527,
## 6: 2.368627 0 157 66336,66977,20430,104215,233406,107995,
It takes about ten seconds to get results with significant hits after FDR correction:
sum(fgseaRes[, padj < 0.01])## [1] 76
One can make an enrichment plot for a pathway:
plotEnrichment(examplePathways[["5991130_Programmed_Cell_Death"]],
exampleRanks) + labs(title="Programmed Cell Death")Or make a table plot for a bunch of selected pathways:
topPathwaysUp <- fgseaRes[ES > 0][head(order(pval), n=10), pathway]
topPathwaysDown <- fgseaRes[ES < 0][head(order(pval), n=10), pathway]
topPathways <- c(topPathwaysUp, rev(topPathwaysDown))
plotGseaTable(examplePathways[topPathways], exampleRanks, fgseaRes,
gseaParam = 0.5)Please, be aware that fgsea function takes about O(nk^{3/2}) time, where n is number of permutations and k is a maximal size of the pathways. That means that setting maxSize parameter with a value of ~500 is strongly recommended.
Also, fgsea is parallelized using BiocParallel package. By default the first registered backend returned by bpparam() is used. To tweak the parallelization one can either specify BPPARAM parameter used for bclapply of set nproc parameter, which is a shorthand for setting BPPARAM=MulticoreParam(workers = nproc).
For convenience there is reactomePathways function that obtains pathways from Reactome for given set of genes. Package reactome.db is required to be installed.
pathways <- reactomePathways(names(exampleRanks))
fgseaRes <- fgsea(pathways, exampleRanks, nperm=1000, maxSize=500)
head(fgseaRes)## pathway pval
## 1: Interleukin-6 signaling 0.002040816
## 2: Apoptosis 0.001557632
## 3: Hemostasis 0.006613757
## 4: Intrinsic Pathway for Apoptosis 0.003521127
## 5: Cleavage of Growing Transcript in the Termination Region 0.482014388
## 6: PKB-mediated events 0.478417266
## padj ES NES nMoreExtreme size
## 1: 0.03105125 -0.8129902 -1.8460927 0 6
## 2: 0.03105125 0.5237963 2.0569452 0 66
## 3: 0.06861772 0.2985258 1.4202376 4 257
## 4: 0.04764410 0.6872693 2.2375711 1 28
## 5: 0.78769211 -0.2451371 -0.9957147 200 44
## 6: 0.78458331 0.3248924 1.0119083 265 24
## leadingEdge
## 1: 20848,12402,16195,16194
## 2: 58801,14958,97165,22352,12043,14103,
## 3: 71946,16184,14062,16185,22339,20720,
## 4: 58801,12043,12367,14940,14942,12018,
## 5: 54451,67337,66118,433702,54196,53817,
## 6: 13685,66508,54170,105787,13631,11651,
One can also start from .rnk and .gmt files as in original GSEA:
rnk.file <- system.file("extdata", "naive.vs.th1.rnk", package="fgsea")
gmt.file <- system.file("extdata", "mouse.reactome.gmt", package="fgsea")Loading ranks:
ranks <- read.table(rnk.file,
header=TRUE, colClasses = c("character", "numeric"))
ranks <- setNames(ranks$t, ranks$ID)
str(ranks)## Named num [1:12000] -63.3 -49.7 -43.6 -41.5 -33.3 ...
## - attr(*, "names")= chr [1:12000] "170942" "109711" "18124" "12775" ...
Loading pathways:
pathways <- gmtPathways(gmt.file)
str(head(pathways))## List of 6
## $ 1221633_Meiotic_Synapsis : chr [1:64] "12189" "13006" "15077" "15078" ...
## $ 1368092_Rora_activates_gene_expression : chr [1:9] "11865" "12753" "12894" "18143" ...
## $ 1368110_Bmal1:Clock,Npas2_activates_circadian_gene_expression : chr [1:16] "11865" "11998" "12753" "12952" ...
## $ 1445146_Translocation_of_Glut4_to_the_Plasma_Membrane : chr [1:55] "11461" "11465" "11651" "11652" ...
## $ 186574_Endocrine-committed_Ngn3+_progenitor_cells : chr [1:4] "18012" "18088" "18506" "53626"
## $ 186589_Late_stage_branching_morphogenesis_pancreatic_bud_precursor_cells: chr [1:4] "11925" "15205" "21410" "246086"
And runnig fgsea:
fgseaRes <- fgsea(pathways, ranks, minSize=15, maxSize=500, nperm=1000)
head(fgseaRes)## pathway
## 1: 1221633_Meiotic_Synapsis
## 2: 1445146_Translocation_of_Glut4_to_the_Plasma_Membrane
## 3: 442533_Transcriptional_Regulation_of_Adipocyte_Differentiation_in_3T3-L1_Pre-adipocytes
## 4: 508751_Circadian_Clock
## 5: 5334727_Mus_musculus_biological_processes
## 6: 573389_NoRC_negatively_regulates_rRNA_expression
## pval padj ES NES nMoreExtreme size
## 1: 0.5361345 0.7124806 0.2885754 0.9355472 318 27
## 2: 0.6748366 0.8222483 0.2387284 0.8431415 412 39
## 3: 0.1185185 0.2776338 -0.3640706 -1.3350019 47 31
## 4: 0.7902946 0.8808833 0.2516324 0.7290777 455 17
## 5: 0.3795181 0.5866431 0.2469065 1.0477230 251 106
## 6: 0.4142114 0.6266551 0.3607407 1.0452072 238 17
## leadingEdge
## 1: 15270,12189,71846,19357
## 2: 17918,19341,20336,22628,22627,20619,
## 3: 20602,327987,59024,67381,70208,12537,
## 4: 20893,59027,19883
## 5: 60406,19361,15270,20893,12189,68240,
## 6: 60406,20018,245688,20017