In this document we illustrate some approaches to
working with UK Biobank summary statistics.
Be sure that that the python module ukbb_pan_ancestry
has been installed where reticulate can find it.
(We don’t use basilisk as of 12/24/2022 because of
issues in the terra spark cluster.)
If the above command indicates that BiocHail is not available, see the Installation section near the end of this document.
We have produced a representation of summary statistics for a sample of 9888 loci. This 5GB resource can be retrieved and cached with the following code:
hl = hail_init()
ss = get_ukbb_sumstat_10kloci_mt(hl) # can take about a minute to unzip 5GB
ss$count() # but if a persistent MatrixTable is at the location given
## [[1]]
## [1] 9888
##
## [[2]]
## [1] 7271
To get a description of available content, we need a python chunk:
## ----------------------------------------
## Global fields:
## None
## ----------------------------------------
## Column fields:
## 'trait_type': str
## 'phenocode': str
## 'pheno_sex': str
## 'coding': str
## 'modifier': str
## 'pheno_data': array<struct {
## n_cases: int32,
## n_controls: int32,
## heritability: float64,
## saige_version: str,
## inv_normalized: bool,
## pop: str
## }>
## 'description': str
## 'description_more': str
## 'coding_description': str
## 'category': str
## 'n_cases_full_cohort_both_sexes': int64
## 'n_cases_full_cohort_females': int64
## 'n_cases_full_cohort_males': int64
## ----------------------------------------
## Row fields:
## 'locus': locus<GRCh37>
## 'alleles': array<str>
## 'gene': str
## 'annotation': str
## ----------------------------------------
## Entry fields:
## 'summary_stats': array<struct {
## AF_Allele2: float64,
## imputationInfo: float64,
## BETA: float64,
## SE: float64,
## `p.value.NA`: float64,
## `AF.Cases`: float64,
## `AF.Controls`: float64,
## Pvalue: float64,
## low_confidence: bool
## }>
## ----------------------------------------
## Column key: ['trait_type', 'phenocode', 'pheno_sex', 'coding', 'modifier']
## Row key: ['locus', 'alleles']
## ----------------------------------------
Here’s a basic description of the summary stats table, with code that works in terra.bio:
hl = bare_hail()
hl$init(idempotent=TRUE, spark_conf=list(
'spark.hadoop.fs.gs.requester.pays.mode'= 'CUSTOM',
'spark.hadoop.fs.gs.requester.pays.buckets'= 'ukb-diverse-pops-public',
'spark.hadoop.fs.gs.requester.pays.project.id'= Sys.getenv("GOOGLE_PROJECT")))
We need to use a python chunk to get the output, using gs:// storage references.
We’ll collect the column metadata to start to understand details of annotation of phenotypes.
sscol = ss$cols()$collect() # OK because we are just working over column metadata
ss1 = sscol[[1]]
names(ss1)
## [1] "annotate" "category"
## [3] "coding" "coding_description"
## [5] "description" "description_more"
## [7] "drop" "get"
## [9] "items" "keys"
## [11] "modifier" "n_cases_full_cohort_both_sexes"
## [13] "n_cases_full_cohort_females" "n_cases_full_cohort_males"
## [15] "pheno_data" "pheno_sex"
## [17] "phenocode" "select"
## [19] "trait_type" "values"
## [1] "30600"
## [1] "Albumin"
We’ve introduced a function that extracts selected fields for a given phenotype, that accommodates availability of results for specific populations.
## trait_type phenocode description modifier coding_description coding n_cases
## 1 biomarkers 30600 Albumin irnt NA 5759
## 2 biomarkers 30600 Albumin irnt NA 856
## 3 biomarkers 30600 Albumin irnt NA 7694
## 4 biomarkers 30600 Albumin irnt NA 2340
## 5 biomarkers 30600 Albumin irnt NA 367192
## 6 biomarkers 30600 Albumin irnt NA 1364
## n_controls heritability pop
## 1 NA 0.25412190 AFR
## 2 NA 0.11268439 AMR
## 3 NA 0.24110706 CSA
## 4 NA 0.06126386 EAS
## 5 NA 0.06449071 EUR
## 6 NA 0.20458513 MID
This can be iterated over all the elements of sscol
to produce
a comprehensive searchable table. Here’s a small illustration:
We’ll trim the material to be worked with by sampling both rows and columns. (2023.01.08: In future revisions we will be able to control the seed for random sampling.)
## [[1]]
## [1] 107
##
## [[2]]
## [1] 78
Row metadata are simple to collect:
## [1] "alleles" "annotate" "annotation" "drop" "gene"
## [6] "get" "items" "keys" "locus" "select"
## [11] "values"
## [1] "contig" "parse" "position" "reference_genome"
## [1] "1"
## contig position
## "1" "26753527"
The summary statistics themselves reside in entries of the MatrixTable. This can be expensive to collect and so filtering methods beyond random sampling must be mastered. But here is a basic view.
## [1] 8346
## [1] "alleles" "annotate"
## [3] "annotation" "category"
## [5] "coding" "coding_description"
## [7] "description" "description_more"
## [9] "drop" "gene"
## [11] "get" "items"
## [13] "keys" "locus"
## [15] "modifier" "n_cases_full_cohort_both_sexes"
## [17] "n_cases_full_cohort_females" "n_cases_full_cohort_males"
## [19] "pheno_data" "pheno_sex"
## [21] "phenocode" "select"
## [23] "summary_stats" "trait_type"
## [25] "values"
The summary_stats
component has the association p-values – log10 transformed?
## [1] 1
## [1] "AF.Cases" "AF.Controls" "AF_Allele2" "BETA"
## [5] "Pvalue" "SE" "annotate" "drop"
## [9] "get" "imputationInfo" "items" "keys"
## [13] "low_confidence" "p.value.NA" "select" "values"
## [1] -0.340249
BiocHail
BiocHail
should be installed as follows:
## R version 4.3.0 RC (2023-04-13 r84269)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.17-bioc/R/lib/libRblas.so
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] DT_0.27 ggplot2_3.4.2 BiocHail_1.0.0 BiocStyle_2.28.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.3 dir.expiry_1.8.0 xfun_0.39
## [4] bslib_0.4.2 htmlwidgets_1.6.2 lattice_0.21-8
## [7] crosstalk_1.2.0 vctrs_0.6.2 tools_4.3.0
## [10] generics_0.1.3 curl_5.0.0 parallel_4.3.0
## [13] tibble_3.2.1 fansi_1.0.4 RSQLite_2.3.1
## [16] highr_0.10 blob_1.2.4 pkgconfig_2.0.3
## [19] Matrix_1.5-4 dbplyr_2.3.2 lifecycle_1.0.3
## [22] compiler_4.3.0 farver_2.1.1 munsell_0.5.0
## [25] htmltools_0.5.5 sass_0.4.5 yaml_2.3.7
## [28] pillar_1.9.0 jquerylib_0.1.4 ellipsis_0.3.2
## [31] cachem_1.0.7 magick_2.7.4 basilisk_1.12.0
## [34] tidyselect_1.2.0 digest_0.6.31 dplyr_1.1.2
## [37] purrr_1.0.1 bookdown_0.33 labeling_0.4.2
## [40] rprojroot_2.0.3 fastmap_1.1.1 grid_4.3.0
## [43] here_1.0.1 colorspace_2.1-0 cli_3.6.1
## [46] magrittr_2.0.3 utf8_1.2.3 withr_2.5.0
## [49] filelock_1.0.2 scales_1.2.1 bit64_4.0.5
## [52] rmarkdown_2.21 httr_1.4.5 bit_4.0.5
## [55] reticulate_1.28 png_0.1-8 memoise_2.0.1
## [58] evaluate_0.20 knitr_1.42 basilisk.utils_1.12.0
## [61] BiocFileCache_2.8.0 rlang_1.1.0 Rcpp_1.0.10
## [64] glue_1.6.2 DBI_1.1.3 BiocManager_1.30.20
## [67] BiocGenerics_0.46.0 jsonlite_1.8.4 R6_2.5.1