1. Introduction

Intra-tumor heterogeneity (ITH) is now thought to be a key factor that results in the therapeutic failures and drug resistance, which have arose increasing attention in cancer research. Here, we present an R package, MesKit, for characterizing cancer genomic ITH and inferring the history of tumor evolutionary. MesKit provides a wide range of analyses including ITH evaluation, enrichment, signature, clone evolution analysis via implementation of well-established computational and statistical methods. The source code and documents are freely available through Github (https://github.com/Niinleslie/MesKit). We also developed a shiny application to provide easier analysis and visualization.

1.1 Citation

In R console, enter citation("MesKit").

MesKit: a tool kit for dissecting cancer evolution from multi-region derived tumor biopsies via somatic mutations (Submitted)

2. Prepare input Data

To analyze with MesKit, you need to provide:

  • A MAF file of multi-region samples from patients (*.maf / *.maf.gz). Required
  • Cancer cell fraction (CCF) data of somatic mutations. Optional but recommended
  • A segmentation file. Optional
  • The GISTIC outputs. Optional

Note: Patient_ID and Tumor_Sample_Barcode should be consistant in all input files, respectively.

2.1 MAF file

Mutation Annotation Format (MAF) files are tab-delimited text files with aggregated mutations information from VCF Files. The input MAF file (or “*.maf.gz“) of MesKit should have additional columns named Patient_ID and Tumor_ID on the basis of standard MAF files. Besides, as for the Variant_Classificationcolumn, allowed values can be found at Mutation Annotation Format Page.
The following fields are required to be contained in the MAF files with MesKit.

Mandatory fields:

Hugo_Symbol, Chromosome, Start_Position, End_Position, Variant_Classification, Variant_Type, Reference_Allele, Tumor_Seq_Allele2, Ref_allele_depth, Alt_allele_dept, VAF, Tumor_Sample_Barcode, Patient_ID, Tumor_ID

Note: Multi-region samples from the a single tumor are indicated with the same Tumor_ID, such as “primary”, “metastasis” and “lymph”. In addition, values in the Hugo_Symbol field are not necessarily from the HUGO database. Example MAF file

##   Hugo_Symbol Chromosome Start_Position End_Position Variant_Classification
## 1      CFAP74          1        1880545      1880545                 Intron
## 2      TFAP2A          6       10159520     10159520                    IGR
## 3      IGSF21          1       18605309     18605309                 Intron
##   Variant_Type Reference_Allele Tumor_Seq_Allele2 Ref_allele_depth
## 1          SNP                C                 A               16
## 2          SNP                T                 A               29
## 3          SNP                A                 C              144
##   Alt_allele_depth    VAF Tumor_Sample_Barcode Patient_ID Tumor_ID
## 1                3 0.1578                   T1    HCC5647  Primary
## 2               17 0.3695                   T1    HCC6952  Primary
## 3               19 0.1165                   T4    HCC8031  Primary

2.2 CCF files

By default, there are six mandatory fields in input CCF file: Patient_ID, Tumor_Sample_Barcode, Chromosome, Start_Position, CCF and CCF_std/CCF_CI_High (required when identifying clonal/subclonal mutations). Chromosome field of mafFile and ccfFile should be in format (both in number or both start with “chr”). Notably, if CCF files contain other variants apart from SNVs, Reference_Allele and Tumor_Seq_Allele2 should also be included in the input CCF files.

Example CCF file

##   Patient_ID Tumor_Sample_Barcode Chromosome Start_Position       CCF
## 1    HCC5647                   T4         22       43190575 0.6112993
## 2    HCC5647                   T5         22       43190575 0.6239556
## 3    HCC5647                   T3         22       43190575 0.5121414
## 4    HCC5647                   T1         22       43190575 0.6891924
## 5    HCC5647                   T4          5      178224614 0.7668806
##      CCF_Std
## 1 0.19713556
## 2 0.19523997
## 3 0.17751275
## 4 0.21622254
## 5 0.09085722

2.3 Segmentation files

The segmentation file is a tab-delimited file with the following 6 or 7 columns:

  • Patient_ID - ID of a patient
  • Tumor_Sample_Barcode - Tumor sample barcode of samples
  • Chromosome - chromosome name or ID
  • Start_Position - genomic start position of segments (1-indexed)
  • End_Position - genomic end position of segments (1-indexed)
  • Segment_Mean/CopyNumber - segment mean value or absolute integer copy number

Note: Positions are in base pair units.

Example Segmentation file

##   Patient_ID Tumor_Sample_Barcode Chromosome Start_Position End_Position
## 1    HCC5647                   T1          1         138488      6479452
## 2    HCC5647                   T1          1        6504488    120906360
## 3    HCC5647                   T1          1      144921930    157805992
## 4    HCC5647                   T1          1      157809143    160394321
## 5    HCC5647                   T1          1      160604266    165877230
##   CopyNumber
## 1          2
## 2          2
## 3          6
## 4          2
## 5          8

3. Installation

Via GitHub

Install the latest version of this package by typing the commands below in R console:

4. Start with the Maf object

readMaf function creates Maf/MafList objects by reading MAF files and cancer cell fraction (CCF) data (optional but recommended). Parameter refBuild is used to set reference genome version for Homo sapiens reference ("hg18", "hg19" or "hg38").

5. Mutational landscape

5.1 Mutational profile

In order to explore the genomic alterations during cancer progression with multi-region sequencing approach, we provided classifyMut function to categorize mutations. The classification is based on shared pattern or clonal status (CCF data is required) of mutations, which can be specified by class option. Additionally, classByTumor can be used to reveal the mutational profile within tumors.

plotMutProfile function can visualize the mutational profile of tumor samples.

5.2 CNA profile

The plotCNA function can characterizes the CNA landscape across samples based on copy number data from segmentation algorithms. Besides, MesKit provides options to integrate GISTIC2 results, which can be obtained from http://gdac.broadinstitute.org. Please make sure the genome version based on these results is consistent with refBuild of the Maf/MafList object .

## --Processing LIHC_amp_genes.conf_99.txt
## --Processing LIHC_del_genes.conf_99.txt
## --Processing LIHC_all_lesions.conf_99.txt

6. Measurement of intra-tumor heterogeneity

6.1 Within tumors

6.1.1 MATH score calculation

The mathScore function estimates ITH per sample using mutant-allele tumor heterogeneity (MATH) approach 1. Typically, the higher the MATH score is, more heterogeneous a tumor sample is 2 3. For MRS, this function can estimate the MATH score within tumor based on the merged VAF when withTumor = TRUE.

6.1.2 AUC of CCF

The ccfAUC function estimates the tumor heterogeneity via the area under the curve (AUC) of the cumulative density from all cancer cell fractions per tumor. Tumors with higher AUC values are considered to be more heterogeneous 4. If the MAF data contains information from more than one patient, you can subset by specifying the corresponding Patient ID.

## [1] "AUC.value"        "CCF.density.plot"

6.1.3 VAF clustering

vafCluster function clusters variant allele frequencies (VAFs) based on a Gaussian finite mixture model. This function produces a density distribution plot, depicting clusters of mutations with different VAFs. We consider the number of clusters as a simple indicator of ITH. In addition, this function can read segmentation file to adjust CNA and LOH.

6.2 Between tumors

6.2.1 Fixation index

Fixation Index (Fst) is a classical metric of population genetics, which can be used to measure the genetic divergence between regions 5 6 7. calFst function calculates the proportion of the total variance in allele frequency caused by frequency differences between populations.

## $Fst.avg
## [1] 0.1281842
## 
## $Fst.pair
##           T1        T2         T3         T4        T5
## T1 1.0000000 0.1296320 0.11644355 0.12351694 0.1556410
## T2 0.1296320 1.0000000 0.14284173 0.14535748 0.1663971
## T3 0.1164435 0.1428417 1.00000000 0.04238276 0.1389207
## T4 0.1235169 0.1453575 0.04238276 1.00000000 0.1207090
## T5 0.1556410 0.1663971 0.13892069 0.12070896 1.0000000
## 
## $Fst.plot

6.2.2 Nei’s genetic distance

Nei’s genetic distance is used in population genetics to assess the similarity between populations, taking heterogeneity within populations into account. The function calNeiDist calculates Nei’s distance of CCF among multi-region samples from the same patient for each sample-pair 8.

## $Nei.dist.avg
## [1] 0.3439051
## 
## $Nei.dist
##           T1        T2        T3        T4        T5
## T1 1.0000000 0.3510825 0.2769709 0.3103000 0.4612432
## T2 0.3510825 1.0000000 0.3260584 0.3492942 0.4917281
## T3 0.2769709 0.3260584 1.0000000 0.1393114 0.3798806
## T4 0.3103000 0.3492942 0.1393114 1.0000000 0.3531822
## T5 0.4612432 0.4917281 0.3798806 0.3531822 1.0000000
## 
## $Nei.plot

7. Clonal origins inference

Metastasis remains poorly understood despite its critical clinical significance, and the understanding of metastasis process can offer supplementary information for clinical treatments.

7.1 Pairwise CCF comparison

Distinct patterns of monoclonal versus polyclonal seeding based on the cancer cell fraction (CCF) of somatic mutations between sample/tumor pairs. compareCCF function returns a result list of pairwise CCF of mutations, which are identified across samples from a single patient. Recently, this method has been widely used to deduce the potential metastatic route between different paired tumor lesions 9 10.

7.2 Jaccard similarity index

The Jaccard similarity index (JSI) can be used to calculate mutational similarity between regions, which is defined as the ratio of shared variants to all variants for sample pairs 11. Users can distinguish monoclonal versus polyclonal seeding in metastases (including lymph node metastases and distant metastases) via compareJSI function, and higher JSI values indicate the higher possibility of polyclonal seeding 12.

## [1] "JSI.multi" "JSI.pair"
## [1] 0.439441
##           T1         T2        T3        T4         T5
## T1 1.0000000 0.58333333 0.5757576 0.5416667 0.08771930
## T2 0.5833333 1.00000000 0.4736842 0.4810127 0.07692308
## T3 0.5757576 0.47368421 1.0000000 0.8734177 0.19565217
## T4 0.5416667 0.48101266 0.8734177 1.0000000 0.20454545
## T5 0.0877193 0.07692308 0.1956522 0.2045455 1.00000000

7.3 Neutral evolution

The subclonal mutant allele frequencies of a tumor follow a simple power-law distribution predicted by neutral growth 13. Users can evaluate whether a tumor follows neutral evolution or not under strong selection via the testNeutral function. Tumors with R2 >= R2.threshold (Default: 0.98) are considered to follow neutral evolution. Besides, this function can also generate the model fitting plot of each sample if the argument plot is set toTRUE.

8. Phylogenetic tree analysis

8.1 Phylogenetic tree construction

With MesKit, phylogenetic tree construction for each individual is based on the binary present/absence matrix of mutations across all tumor regions.
Based on the Maf object, getPhyloTree function reconstructs phylogenetic tree in different methods, including “NJ” (Neibor-Joining) , “MP” (maximum parsimony), “ML” (maximum likelihood), “FASTME.ols” and “FASTME.bal”, which can be set by controlling the method parameter. The phylogenetic tree would be stored in a phyloTree/phyloTreeList object, and it can later be utilized for functional exploration, mutational signature analysis and tree visualization.

8.2 Compare Phylogenetic trees

Comparison between phylogenetic trees can reveal consensus patterns of tumor evolution. The compareeTree function computes distances between phylogenetic trees constructed through different methods, and it returns a vector containing four distances by treedist from phangorn R package. See treedist for details.

## Both tree have 2 same branches
## $compare.dist
##     Symmetric.difference       KF-branch distance          Path difference 
##                 2.000000                67.877750                 2.828427 
## Weighted path difference 
##               234.424473 
## 
## $compare.plot

8.4 Mutational characteristics analysis

The sequence context of the base substitutions can be retrieved from the corresponding reference genome to construct a mutation matrix with counts for all 96 trinucleotide changes using “mut_matrix”. Subsequently, the 6 base substitution type spectrum can be plotted with “plot_spectrum” , which can be divided into several sample groups.

mutTrunkBranch function calculates the fraction of branch/trunk mutations occurring in each of the six types of base substitution types (C>A, C>G, C>T, T>A, T>C, T>G) and conducts two-sided Fisher’s exact tests. For C>T mutations, it can be further classified into C>T at CpG sites and other sites by setting CT = TRUE. This function provides option plot to print the distribution of branch/trunk mutations. Substitutions types with significant difference between trunk and branch mutations are marked with Asterisks.

## [1] "mutTrunkBranch.res"  "mutTrunkBranch.plot"

triMatrix function returns a mutation count matrix of 96 trinucleotides based on somatic SNVs per sample. The fitSignatures function takes the matrix and known signature matrix as input to find the optimal nonnegative linear combination of mutation signatures to reconstruct matrix and calculate cosine similarity based on somatic SNVs. Besides, the signature matrix can be specified ("cosmic_v2", "exome_cosmic_v3" or "nature2013") or provided by users via signaturesRef.

plotMutSigProfile function can be utilized to visualize 96 trinucleodide mutational profile or reconstructed mutational profile.

9. Phylogenetic tree visualization

Users can visualize phylogenetic trees with annotations through plotPhyloTree function. By specifying branchCol parameter, the branches can be either colored according to classification of mutations or putative known signatures. Argument show.bootstrap is provided to show the support values of internal nodes. Additionally, via mutHeatmap function, users can get a better depiction of mutational patterns mutations in tumor phylogeny based on MRS. Either binary mutation heatmap (indicates the presence or absence of mutations) or CCF heatmap can be displayed by running mutHeatmap function.

10. References

  1. Mroz, Edmund A, and James W Rocco. MATH, a novel measure of intratumor genetic heterogeneity, is high in poor-outcome classes of head and neck squamous cell carcinoma. Oral oncology vol. 49,3 (2013): 211-5.

  2. Mroz EA , Tward AD , Pickering CR , et al. High intratumor genetic heterogeneity is related to worse outcome in patients with head and neck squamous cell carcinoma[J]. Cancer, 2013, 119(16):3034-3042. https://doi.org/10.1002/cncr.28150

  3. Mroz EA, Tward AM, Hammon RJ, Ren Y, Rocco JW (2015) Intra-tumor Genetic Heterogeneity and Mortality in Head and Neck Cancer: Analysis of Data from The Cancer Genome Atlas. PLoS Med 12(2): e1001786. https://doi.org/10.1371/journal.pmed.1001786

  4. Charoentong, P. et al. Pan-cancer immunogenomic analyses reveal genotypeimmunophenotype relationships and predictors of response to checkpoint blockade. Cell Rep. 18, 248–262 (2017)

  5. Weir, B.S. and Cockerham, C.C. Estimating F-Statistics for the Analysis of Population Structure. Evolution, 1984, 38, 1358-1370. http://dx.doi.org/10.2307/2408641

  6. Bhatia G , Patterson N , Sankararaman S , et al. Estimating and interpreting FST: The impact of rare variants[J]. Genome Research, 2013, 23(9):1514-1521.

  7. Holsinger K E , Weir B S . Genetics in geographically structured populations: defining, estimating and interpreting FST[J]. Nature Reviews Genetics, 2009, 10(9):639-650.

  8. Puente X S, Pinyol M, Quesada V, et al. Whole-genome sequencing identifies recurrent mutations in chronic lymphocytic leukaemia[J]. Nature, 2011, 475(7354):101-105.

  9. Christina Curtis et al. Quantitative evidence for early metastatic seeding in colorectal cancer, Nature Genetics (2019). http://dx.doi.org/10.1038/s41588-019-0423-x

  10. Ruidong Xue et al. Genomic and Transcriptomic Profiling of Combined Hepatocellular and Intrahepatic Cholangiocarcinoma Reveals Distinct Molecular Subtypes. Cancer Cell, 2019, 35(6):932-947.

  11. Makohon-Moore, A., Zhang, M., Reiter, J. et al. Limited heterogeneity of known driver gene mutations among the metastases of individual patients with pancreatic cancer. Nat Genet 49, 358–366 (2017). https://doi.org/10.1038/ng.3764

  12. Hu Z, Li Z, Ma Z & Curtis C. Pan-cancer analysis of clonality and the timing of systemic spread in paired primary tumors and metastases. bioRXiv (2019). http://dx.doi.org/10.1101/825240

  13. Williams, M., Werner, B., Barnes, C. et al. Identification of neutral tumor evolution across cancer types. Nat Genet 48, 238–244 (2016). https://doi.org/10.1038/ng.3489

11. Session Info

## R version 4.0.3 (2020-10-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.12-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.12-bioc/R/lib/libRlapack.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
##  [1] grid      parallel  stats4    stats     graphics  grDevices utils    
##  [8] datasets  methods   base     
## 
## other attached packages:
##  [1] ComplexHeatmap_2.6.0              BSgenome.Hsapiens.UCSC.hg19_1.4.3
##  [3] BSgenome_1.58.0                   rtracklayer_1.50.0               
##  [5] Biostrings_2.58.0                 XVector_0.30.0                   
##  [7] GenomicRanges_1.42.0              GenomeInfoDb_1.26.0              
##  [9] clusterProfiler_3.18.0            org.Hs.eg.db_3.12.0              
## [11] AnnotationDbi_1.52.0              IRanges_2.24.0                   
## [13] S4Vectors_0.28.0                  Biobase_2.50.0                   
## [15] BiocGenerics_0.36.0               MesKit_1.0.0                     
## 
## loaded via a namespace (and not attached):
##   [1] fgsea_1.16.0                colorspace_1.4-1           
##   [3] rjson_0.2.20                ellipsis_0.3.1             
##   [5] ggridges_0.5.2              mclust_5.4.6               
##   [7] circlize_0.4.10             qvalue_2.22.0              
##   [9] GlobalOptions_0.1.2         clue_0.3-57                
##  [11] farver_2.0.3                graphlayouts_0.7.1         
##  [13] ggrepel_0.8.2               bit64_4.0.5                
##  [15] scatterpie_0.1.5            splines_4.0.3              
##  [17] GOSemSim_2.16.0             knitr_1.30                 
##  [19] polyclip_1.10-0             jsonlite_1.7.1             
##  [21] Rsamtools_2.6.0             Cairo_1.5-12.2             
##  [23] cluster_2.1.0               GO.db_3.12.0               
##  [25] png_0.1-7                   ggforce_0.3.2              
##  [27] BiocManager_1.30.10         compiler_4.0.3             
##  [29] rvcheck_0.1.8               Matrix_1.2-18              
##  [31] tweenr_1.0.1                htmltools_0.5.0            
##  [33] tools_4.0.3                 igraph_1.2.6               
##  [35] gtable_0.3.0                glue_1.4.2                 
##  [37] GenomeInfoDbData_1.2.4      reshape2_1.4.4             
##  [39] DO.db_2.9                   dplyr_1.0.2                
##  [41] fastmatch_1.1-0             Rcpp_1.0.5                 
##  [43] enrichplot_1.10.0           vctrs_0.3.4                
##  [45] ape_5.4-1                   nlme_3.1-150               
##  [47] ggraph_2.0.3                xfun_0.18                  
##  [49] stringr_1.4.0               lifecycle_0.2.0            
##  [51] phangorn_2.5.5              XML_3.99-0.5               
##  [53] DOSE_3.16.0                 zlibbioc_1.36.0            
##  [55] MASS_7.3-53                 scales_1.1.1               
##  [57] tidygraph_1.2.0             MatrixGenerics_1.2.0       
##  [59] SummarizedExperiment_1.20.0 RColorBrewer_1.1-2         
##  [61] yaml_2.2.1                  memoise_1.1.0              
##  [63] gridExtra_2.3               ggplot2_3.3.2              
##  [65] downloader_0.4              stringi_1.5.3              
##  [67] RSQLite_2.2.1               BiocParallel_1.24.0        
##  [69] shape_1.4.5                 rlang_0.4.8                
##  [71] pkgconfig_2.0.3             matrixStats_0.57.0         
##  [73] bitops_1.0-6                pracma_2.2.9               
##  [75] evaluate_0.14               lattice_0.20-41            
##  [77] purrr_0.3.4                 GenomicAlignments_1.26.0   
##  [79] labeling_0.4.2              cowplot_1.1.0              
##  [81] shadowtext_0.0.7            bit_4.0.4                  
##  [83] tidyselect_1.1.0            plyr_1.8.6                 
##  [85] magrittr_1.5                R6_2.4.1                   
##  [87] magick_2.5.0                generics_0.0.2             
##  [89] DelayedArray_0.16.0         DBI_1.1.0                  
##  [91] pillar_1.4.6                mgcv_1.8-33                
##  [93] RCurl_1.98-1.2              tibble_3.0.4               
##  [95] crayon_1.3.4                KernSmooth_2.23-17         
##  [97] rmarkdown_2.5               viridis_0.5.1              
##  [99] GetoptLong_1.0.4            data.table_1.13.2          
## [101] blob_1.2.1                  digest_0.6.27              
## [103] tidyr_1.1.2                 munsell_0.5.0              
## [105] viridisLite_0.3.0           quadprog_1.5-8