This document explores using Hail 0.2 with R via basilisk.
The computations follow the GWAS tutorial in the hail documentation. We won’t do all the computations there, and we add some material dealing with R-python interfacing. We’ll note that the actual computations on large data are done in Spark, but we don’t interact directly with Spark at all in this document.
Most of the computations are done via reticulate
calls to python; the access to the hail environment
is through basilisk. We also take advantage of
R markdown’s capacity to execute python code directly.
If an R chunk computes x
, a python chunk can refer
to it as r.x
. If a python chunk computes r.x
,
an R chunk can refer to this value as x
.
BiocHail
BiocHail
should be installed as follows:
As of 1.0.0, a JDK for Java version <=
11.0 is necessary
to use the version of Hail that is installed with the package.
This package should be usable on MacOS with suitable java
support. If Java version >=
8.x is used, warnings from
Apache Spark may be observed. To the best of our knowledge
the conditions to which the warnings pertain do not affect program performance.
In this section we import the 1000 genomes VCF slice
distributed by the hail project. hail_init
uses basilisk,
which ensures that a specific version of hail and
its dependencies are available in an isolated virtual environment.
Here is a curiosity of R-hail interaction. Note that
the following chunk computes mt
, a MatrixTable
representation of 1000 genomes data, but our
attempt to print it in markdown fails.
## + '/home/biocbuild/.cache/R/basilisk/1.12.0/0/bin/conda' 'create' '--yes' '--prefix' '/home/biocbuild/.cache/R/basilisk/1.12.0/BiocHail/1.0.0/bsklenv' 'python=3.8.13' '--quiet' '-c' 'conda-forge'
## + '/home/biocbuild/.cache/R/basilisk/1.12.0/0/bin/conda' 'install' '--yes' '--prefix' '/home/biocbuild/.cache/R/basilisk/1.12.0/BiocHail/1.0.0/bsklenv' 'python=3.8.13'
## + '/home/biocbuild/.cache/R/basilisk/1.12.0/0/bin/conda' 'install' '--yes' '--prefix' '/home/biocbuild/.cache/R/basilisk/1.12.0/BiocHail/1.0.0/bsklenv' '-c' 'conda-forge' 'python=3.8.13' 'pandas=1.3.5'
## <hail.matrixtable.MatrixTable object at 0x7fbd5a37c4c0>
## NULL
We can use the python syntax in a python
R markdown
chunk to see what we want. We use prefix r.
to
find references defined in our R session (compiling
the vignette).
## +---------------+------------+
## | locus | alleles |
## +---------------+------------+
## | locus<GRCh37> | array<str> |
## +---------------+------------+
## | 1:904165 | ["G","A"] |
## | 1:909917 | ["G","A"] |
## | 1:986963 | ["C","T"] |
## | 1:1563691 | ["T","G"] |
## | 1:1707740 | ["T","G"] |
## +---------------+------------+
## showing top 5 rows
The sample IDs:
## +-----------+
## | s |
## +-----------+
## | str |
## +-----------+
## | "HG00096" |
## | "HG00099" |
## | "HG00105" |
## | "HG00118" |
## | "HG00129" |
## +-----------+
## showing top 5 rows
Some methods return data immediately useful in R.
## [[1]]
## [1] 10961
##
## [[2]]
## [1] 284
We can thus define a function dim
to behave with
hail MatrixTable instances
in a familiar way, along with some others.
## [1] 10961 284
ncol.hail.matrixtable.MatrixTable <- function(x) {
dim(x)[2]
}
nrow.hail.matrixtable.MatrixTable <- function(x) {
dim(x)[1]
}
nrow(mt)
## [1] 10961
These can be useful on their own, or when calling python methods.
column fields
## ----------------------------------------
## Global fields:
## None
## ----------------------------------------
## Row fields:
## 'Sample': str
## 'Population': str
## 'SuperPopulation': str
## 'isFemale': bool
## 'PurpleHair': bool
## 'CaffeineConsumption': int32
## ----------------------------------------
## Key: ['Sample']
## ----------------------------------------
## +-----------+------------+-----------------+----------+------------+---------------------+
## | Sample | Population | SuperPopulation | isFemale | PurpleHair | CaffeineConsumption |
## +-----------+------------+-----------------+----------+------------+---------------------+
## | str | str | str | bool | bool | int32 |
## +-----------+------------+-----------------+----------+------------+---------------------+
## | "HG00096" | "GBR" | "EUR" | False | False | 4 |
## | "HG00097" | "GBR" | "EUR" | True | True | 4 |
## | "HG00098" | "GBR" | "EUR" | False | False | 5 |
## | "HG00099" | "GBR" | "EUR" | True | False | 4 |
## | "HG00100" | "GBR" | "EUR" | True | False | 5 |
## | "HG00101" | "GBR" | "EUR" | False | True | 1 |
## | "HG00102" | "GBR" | "EUR" | True | True | 6 |
## | "HG00103" | "GBR" | "EUR" | False | True | 5 |
## | "HG00104" | "GBR" | "EUR" | True | False | 5 |
## | "HG00105" | "GBR" | "EUR" | False | False | 4 |
## +-----------+------------+-----------------+----------+------------+---------------------+
## showing top 10 rows
We combine the tab
defined above, with the MatrixTable instance,
using python code reaching to R via r.
.
## --------------------------------------------------------
## Type:
## struct {
## s: str,
## pheno: struct {
## Population: str,
## SuperPopulation: str,
## isFemale: bool,
## PurpleHair: bool,
## CaffeineConsumption: int32
## }
## }
## --------------------------------------------------------
## Source:
## <hail.matrixtable.MatrixTable object at 0x7fbd601f53d0>
## Index:
## ['column']
## --------------------------------------------------------
Aggregation methods can be used to obtain contingency tables or descriptive statistics.
First, we get the frequencies of superpopulation membership:
## $AFR
## [1] 76
##
## $AMR
## [1] 34
##
## $EAS
## [1] 72
##
## $EUR
## [1] 47
##
## $SAS
## [1] 55
Then statistics on caffeine consumption:
## [1] "annotate" "drop" "get" "items" "keys" "max"
## [7] "mean" "min" "n" "select" "stdev" "sum"
## [13] "values"
## [1] 4.415493
## [1] 1.577763
The significance of the aggregation functions is that the computations are performed by Spark, on potentially huge distributed data structures.
Now we aggregate over rows (SNPs). We’ll use python directly:
from pprint import pprint # python chunk!
snp_counts = r.mt.aggregate_rows(r.hl.agg.counter(r.hl.Struct(ref=r.mt.alleles[0], alt=r.mt.alleles[1])))
pprint(snp_counts)
## {Struct(ref='A', alt='C'): 454,
## Struct(ref='T', alt='C'): 1879,
## Struct(ref='C', alt='G'): 150,
## Struct(ref='C', alt='A'): 496,
## Struct(ref='A', alt='T'): 76,
## Struct(ref='A', alt='G'): 1944,
## Struct(ref='G', alt='A'): 2387,
## Struct(ref='T', alt='A'): 79,
## Struct(ref='T', alt='G'): 468,
## Struct(ref='C', alt='T'): 2436,
## Struct(ref='G', alt='T'): 480,
## Struct(ref='G', alt='C'): 112}
Hail uses the concept of ‘entries’ for matrix elements, and each ‘entry’ is a ‘struct’ with potentially many fields.
Here we’ll use R to compute a histogram of sequencing depths over all samples and variants.
## [1] "annotate" "bin_edges" "bin_freq" "drop" "get" "items"
## [7] "keys" "n_larger" "n_smaller" "select" "values"
## [1] 31
## [1] 30
midpts <- function(x) diff(x)/2+x[-length(x)]
dpdf <- data.frame(x=midpts(p_hist$bin_edges), y=p_hist$bin_freq)
ggplot(dpdf, aes(x=x,y=y)) + geom_col() + ggtitle("DP") + ylab("Frequency")
An exercise: produce a function mt_hist
that produces
a histogram of measures from any of the relevant
VCF components of a MatrixTable instance.
Note also all the aggregators available:
## [1] "aggregators" "all" "any"
## [4] "approx_cdf" "approx_median" "approx_quantiles"
## [7] "array_agg" "array_sum" "call_stats"
## [10] "collect" "collect_as_set" "corr"
## [13] "count" "count_where" "counter"
## [16] "downsample" "explode" "filter"
## [19] "fold" "fraction" "group_by"
## [22] "hardy_weinberg_test" "hist" "inbreeding"
## [25] "info_score" "linreg" "max"
## [28] "mean" "min" "ndarray_sum"
## [31] "product" "stats" "sum"
## [34] "take"
We’d also note that hail has a direct interface to ggplot2.
A high-level function adds quality metrics to the MatrixTable.
## --------------------------------------------------------
## Type:
## struct {
## s: str,
## pheno: struct {
## Population: str,
## SuperPopulation: str,
## isFemale: bool,
## PurpleHair: bool,
## CaffeineConsumption: int32
## },
## sample_qc: struct {
## dp_stats: struct {
## mean: float64,
## stdev: float64,
## min: float64,
## max: float64
## },
## gq_stats: struct {
## mean: float64,
## stdev: float64,
## min: float64,
## max: float64
## },
## call_rate: float64,
## n_called: int64,
## n_not_called: int64,
## n_filtered: int64,
## n_hom_ref: int64,
## n_het: int64,
## n_hom_var: int64,
## n_non_ref: int64,
## n_singleton: int64,
## n_snp: int64,
## n_insertion: int64,
## n_deletion: int64,
## n_transition: int64,
## n_transversion: int64,
## n_star: int64,
## r_ti_tv: float64,
## r_het_hom_var: float64,
## r_insertion_deletion: float64
## }
## }
## --------------------------------------------------------
## Source:
## <hail.matrixtable.MatrixTable object at 0x7fbd5abb8e80>
## Index:
## ['column']
## --------------------------------------------------------
The call rate histogram is given by:
We’ll use the python code given for filtering, in which per-sample mean read depth and call rate are must exceed (arbitrarily chosen) thresholds.
# python chunk!
r.mt = r.mt.filter_cols((r.mt.sample_qc.dp_stats.mean >= 4) & (r.mt.sample_qc.call_rate >= 0.97))
print('After filter, %d/284 samples remain.' % r.mt.count_cols())
## After filter, 250/284 samples remain.
Again we use the python code for filtering according to
ab = r.mt.AD[1] / r.hl.sum(r.mt.AD)
filter_condition_ab = ((r.mt.GT.is_hom_ref() & (ab <= 0.1)) |
(r.mt.GT.is_het() & (ab >= 0.25) & (ab <= 0.75)) |
(r.mt.GT.is_hom_var() & (ab >= 0.9)))
fraction_filtered = r.mt.aggregate_entries(r.hl.agg.fraction(~filter_condition_ab))
print(f'Filtering {fraction_filtered * 100:.2f}% entries out of downstream analysis.')
## Filtering 3.60% entries out of downstream analysis.
Note that filtering entries does not change MatrixTable shape.
## [1] 10961 250
Allele frequencies, Hardy-Weinberg equilibrium test results
and other summaries are obtained using the variant_qc
function.
## --------------------------------------------------------
## Type:
## struct {
## locus: locus<GRCh37>,
## alleles: array<str>,
## rsid: str,
## qual: float64,
## filters: set<str>,
## info: struct {
## AC: array<int32>,
## AF: array<float64>,
## AN: int32,
## BaseQRankSum: float64,
## ClippingRankSum: float64,
## DP: int32,
## DS: bool,
## FS: float64,
## HaplotypeScore: float64,
## InbreedingCoeff: float64,
## MLEAC: array<int32>,
## MLEAF: array<float64>,
## MQ: float64,
## MQ0: int32,
## MQRankSum: float64,
## QD: float64,
## ReadPosRankSum: float64,
## set: str
## },
## variant_qc: struct {
## dp_stats: struct {
## mean: float64,
## stdev: float64,
## min: float64,
## max: float64
## },
## gq_stats: struct {
## mean: float64,
## stdev: float64,
## min: float64,
## max: float64
## },
## AC: array<int32>,
## AF: array<float64>,
## AN: int32,
## homozygote_count: array<int32>,
## call_rate: float64,
## n_called: int64,
## n_not_called: int64,
## n_filtered: int64,
## n_het: int64,
## n_non_ref: int64,
## het_freq_hwe: float64,
## p_value_hwe: float64,
## p_value_excess_het: float64
## }
## }
## --------------------------------------------------------
## Source:
## <hail.matrixtable.MatrixTable object at 0x7fbd5a2ebdf0>
## Index:
## ['row']
## --------------------------------------------------------
A built-in procedure for testing for association between the (simulated) caffeine consumption measure and genotype will be used.
The following commands eliminate variants with minor allele frequency less than 0.01, along with those with small \(p\)-values in tests of Hardy-Weinberg equilibrium.
r.mt = r.mt.filter_rows(r.mt.variant_qc.AF[1] > 0.01)
r.mt = r.mt.filter_rows(r.mt.variant_qc.p_value_hwe > 1e-6)
r.mt.count()
## (7844, 250)
Now we perform a naive test of association. The Manhattan plot generated by hail can be displayed for interaction using bokeh. We comment this out for now; it should be possible to embed the bokeh display in this document but the details are not ready-to-hand.
r.gwas = r.hl.linear_regression_rows(y=r.mt.pheno.CaffeineConsumption,
x=r.mt.GT.n_alt_alleles(),
covariates=[1.0])
# r.pl = r.hl.plot.manhattan(r.gwas.p_value)
# import bokeh
# bokeh.plotting.show(r.pl)
## 2023-04-25 16:22:18.542 Hail: INFO: linear_regression_rows: running on 250 samples for 1 response variable y,
## with input variable x, and 1 additional covariate...
## 2023-04-25 16:22:20.379 Hail: INFO: wrote table with 7844 rows in 2 partitions to /tmp/persist_table9bfF3tKMXy
## Total size: 520.43 KiB
## * Rows: 520.42 KiB
## * Globals: 11.00 B
## * Smallest partition: 3782 rows (250.69 KiB)
## * Largest partition: 4062 rows (269.73 KiB)
The “QQ plot” that helps evaluate adequacy of the analysis
can be formed using hl.plot.qq
for very large applications;
here we collect the results for plotting in R.
First we estimate \(\lambda_{GC}\)
pv = gwas$p_value$collect()
x2 = stats::qchisq(1-pv,1)
lam = stats::median(x2, na.rm=TRUE)/stats::qchisq(.5,1)
lam
## [1] 3.558453
And the qqplot:
qqplot(-log10(ppoints(length(pv))), -log10(pv), xlim=c(0,8), ylim=c(0,8),
ylab="-log10 p", xlab="expected")
abline(0,1)
There is hardly any point to examining a Manhattan plot in this situation. But let’s see how it might be done. We’ll use igvR to get an interactive and extensible display.
locs <- gwas$locus$collect()
conts <- sapply(locs, function(x) x$contig)
pos <- sapply(locs, function(x) x$position)
library(igvR)
mytab <- data.frame(chr=as.character(conts), pos=pos, pval=pv)
gt <- GWASTrack("simp", mytab, chrom.col=1, pos.col=2, pval.col=3)
igv <- igvR()
setGenome(igv, "hg19")
displayTrack(igv, gt)
An approach to assessing population stratification
is provided as hwe_normalized_pca
. See
the hail methods docs
for details.
We are avoiding a tuple assignment in the tutorial document.
## 2023-04-25 16:22:24.945 Hail: INFO: hwe_normalize: found 7836 variants after filtering out monomorphic sites.
## 2023-04-25 16:22:27.659 Hail: INFO: pca: running PCA with 10 components...
## 2023-04-25 16:22:32.964 Hail: INFO: wrote table with 0 rows in 0 partitions to /tmp/persist_tableuDcJUb0wJG
## Total size: 21.29 KiB
## * Rows: 0.00 B
## * Globals: 21.29 KiB
## * Smallest partition: N/A
## * Largest partition: N/A
We’ll collect the key information and plot.
sc <- pcastuff[[2]]$scores$collect()
pc1 = sapply(sc, "[", 1)
pc2 = sapply(sc, "[", 2)
fac = mt$pheno$SuperPopulation$collect()
myd = data.frame(pc1, pc2, pop=fac)
library(ggplot2)
ggplot(myd, aes(x=pc1, y=pc2, colour=factor(pop))) + geom_point()
Now repeat the association test with adjustments for population of origin and gender.
r.gwas2 = r.hl.linear_regression_rows(
y=r.mt.pheno.CaffeineConsumption,
x=r.mt.GT.n_alt_alleles(),
covariates=[1.0,r.mt.pheno.isFemale,r.mt.scores[0],
r.mt.scores[1], r.mt.scores[2]])
## 2023-04-25 16:22:39.071 Hail: INFO: linear_regression_rows: running on 250 samples for 1 response variable y,
## with input variable x, and 5 additional covariates...
## 2023-04-25 16:22:40.037 Hail: INFO: wrote table with 7844 rows in 2 partitions to /tmp/persist_tablevx31og6NQ2
## Total size: 519.18 KiB
## * Rows: 519.17 KiB
## * Globals: 11.00 B
## * Smallest partition: 3782 rows (250.09 KiB)
## * Largest partition: 4062 rows (269.08 KiB)
New value of \(\lambda_{GC}\):
pv = gwas2$p_value$collect()
x2 = stats::qchisq(1-pv,1)
lam = stats::median(x2, na.rm=TRUE)/stats::qchisq(.5,1)
lam
## [1] 1.08359
A manhattan plot for chr8:
locs <- gwas2$locus$collect()
conts <- sapply(locs, function(x) x$contig)
pos <- sapply(locs, function(x) x$position)
mytab <- data.frame(chr=as.character(conts), pos=pos, pval=pv)
ggplot(mytab[mytab$chr=="8",], aes(x=pos, y=-log10(pval))) + geom_point()
The tutorial document proceeds with some illustrations of arbitrary aggregations. We will skip these for now.
Additional vignettes will address
## R version 4.3.0 RC (2023-04-13 r84269)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_3.4.2 BiocHail_1.0.0 BiocStyle_2.28.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.3 dir.expiry_1.8.0 xfun_0.39
## [4] bslib_0.4.2 lattice_0.21-8 vctrs_0.6.2
## [7] tools_4.3.0 generics_0.1.3 curl_5.0.0
## [10] parallel_4.3.0 tibble_3.2.1 fansi_1.0.4
## [13] RSQLite_2.3.1 highr_0.10 blob_1.2.4
## [16] pkgconfig_2.0.3 Matrix_1.5-4 dbplyr_2.3.2
## [19] lifecycle_1.0.3 compiler_4.3.0 farver_2.1.1
## [22] munsell_0.5.0 htmltools_0.5.5 sass_0.4.5
## [25] yaml_2.3.7 pillar_1.9.0 jquerylib_0.1.4
## [28] cachem_1.0.7 magick_2.7.4 basilisk_1.12.0
## [31] tidyselect_1.2.0 digest_0.6.31 dplyr_1.1.2
## [34] purrr_1.0.1 bookdown_0.33 labeling_0.4.2
## [37] rprojroot_2.0.3 fastmap_1.1.1 grid_4.3.0
## [40] here_1.0.1 colorspace_2.1-0 cli_3.6.1
## [43] magrittr_2.0.3 utf8_1.2.3 withr_2.5.0
## [46] filelock_1.0.2 scales_1.2.1 bit64_4.0.5
## [49] rmarkdown_2.21 httr_1.4.5 bit_4.0.5
## [52] reticulate_1.28 png_0.1-8 memoise_2.0.1
## [55] evaluate_0.20 knitr_1.42 basilisk.utils_1.12.0
## [58] BiocFileCache_2.8.0 rlang_1.1.0 Rcpp_1.0.10
## [61] glue_1.6.2 DBI_1.1.3 BiocManager_1.30.20
## [64] BiocGenerics_0.46.0 jsonlite_1.8.4 R6_2.5.1