When working on your own genome project or when using publicly available
genomes for comparative analyses, it is critical to assess the quality of your
data. Over the past years, several tools have been developed and several
metrics have been proposed to assess the quality of a genome assembly and
annotation. cogeqc
helps users interpret their genome assembly statistics
by comparing them with statistics on publicly available genomes on the NCBI.
Additionally, cogeqc
also provides an interface to BUSCO (Simão et al. 2015),
a popular tool to assess gene space completeness. Graphical functions are
available to make publication-ready plots that summarize the results of
quality control.
You can install cogeqc
from Bioconductor with the following code:
if(!requireNamespace('BiocManager', quietly = TRUE))
install.packages('BiocManager')
BiocManager::install("cogeqc")
# Load package after installation
library(cogeqc)
When analyzing and interpreting genome assembly statistics, it is often
useful to place your stats in a context by comparing them with stats from genomes
of closely-related or even the same species. cogeqc
provides users with
an interface to the NCBI Datasets API, which can be used to retrieve summary
stats for genomes on NCBI. In this section, we will guide you on how to
retrieve such information and use it as a reference to interpret your data.
To obtain a data frame of summary statistics for NCBI genomes of a particular
taxon, you will use the function get_genome_stats()
. In the taxon parameter,
you must specify the taxon from which data will be extracted. This can be done
either by passing a character scalar with taxon name or by passing a numeric
scalar with NCBI Taxonomy ID. For example, the code below demonstrates two
ways of extracting stats on maize (Zea mays) genomes on NCBI:
# Example 1: get stats for all maize genomes using taxon name
maize_stats <- get_genome_stats(taxon = "Zea mays")
head(maize_stats)
#> accession source species_taxid species_name
#> 1 GCA_000005005.6 GENBANK 4577 Zea mays
#> 2 GCA_000223545.1 GENBANK 4577 Zea mays
#> 3 GCA_000275765.1 GENBANK 4577 Zea mays
#> 4 GCA_001644905.2 GENBANK 381124 Zea mays subsp. mays
#> 5 GCA_001984235.2 GENBANK 381124 Zea mays subsp. mays
#> 6 GCA_001990705.1 GENBANK 381124 Zea mays subsp. mays
#> species_common_name species_ecotype species_strain species_isolate
#> 1 maize <NA> NA <NA>
#> 2 maize <NA> NA <NA>
#> 3 maize <NA> NA <NA>
#> 4 maize <NA> NA <NA>
#> 5 maize <NA> NA <NA>
#> 6 maize <NA> NA <NA>
#> species_cultivar assembly_level assembly_status
#> 1 B73 Chromosome current
#> 2 Palomero Toluqueno EDMX-2231 Scaffold current
#> 3 B73 Contig current
#> 4 W22 Chromosome current
#> 5 EP1 Chromosome current
#> 6 F7 Chromosome current
#> assembly_name assembly_type submission_date
#> 1 B73 RefGen_v4 haploid 2017-02-07
#> 2 ZeaMays_PT_EDMX2233_1.0 haploid 2011-08-16
#> 3 ZmaysB73_wgs_1.0 haploid 2012-07-03
#> 4 Zm-W22-REFERENCE-NRGENE-2.0 haploid 2017-02-02
#> 5 Zm-EP1-REFERENCE-TUM-1.0 haploid 2017-02-09
#> 6 Zm-F7-REFERENCE-TUM-1.0 haploid 2017-02-09
#> submitter
#> 1 maizesequence
#> 2 Laboratorio Nacional de Genómica para la Biodiversidad (LANGEBIO) CINVESTAV Irapuato
#> 3 maizesequence
#> 4 W22 Sequencing Consortium
#> 5 Technical University of Munich, Plant Breeding
#> 6 Technical University of Munich, Plant Breeding
#> sequencing_technology atypical refseq_category chromosome_count
#> 1 PacBio FALSE <NA> 10
#> 2 454 GS20; 454 Titanium; Sanger TRUE <NA> NA
#> 3 454 GS FLX TRUE <NA> NA
#> 4 Illumina HiSeq; 10x Genomics FALSE <NA> 10
#> 5 Illumina FALSE <NA> 10
#> 6 Illumina FALSE <NA> 10
#> sequence_length ungapped_length contig_count contig_N50 contig_L50
#> 1 2134373047 2103640169 2787 1279870 506
#> 2 177051422 177051324 196705 1129 44576
#> 3 1335069 1335069 1844 949 434
#> 4 2133880228 2093255169 68134 72426 8765
#> 5 2455259639 2429778437 130125 82504 8782
#> 6 2392801755 2367184004 117311 96966 7318
#> scaffold_count scaffold_N50 scaffold_L50 GC_percent annotation_provider
#> 1 596 10679170 62 46.5 maizesequence
#> 2 196697 1129 44576 45.5 <NA>
#> 3 NA NA NA 45.0 <NA>
#> 4 191 222590201 5 46.5 <NA>
#> 5 60567 249966684 5 46.5 <NA>
#> 6 62610 237276974 5 46.5 <NA>
#> annotation_release_date gene_count_total gene_count_coding
#> 1 2017-02-07 39320 39320
#> 2 <NA> NA NA
#> 3 <NA> NA NA
#> 4 <NA> NA NA
#> 5 <NA> NA NA
#> 6 <NA> NA NA
#> gene_count_noncoding gene_count_pseudogene gene_count_other CC_ratio
#> 1 NA NA NA 278.7
#> 2 NA NA NA NA
#> 3 NA NA NA NA
#> 4 NA NA NA 6813.4
#> 5 NA NA NA 13012.5
#> 6 NA NA NA 11731.1
str(maize_stats)
#> 'data.frame': 95 obs. of 36 variables:
#> $ accession : chr "GCA_000005005.6" "GCA_000223545.1" "GCA_000275765.1" "GCA_001644905.2" ...
#> $ source : chr "GENBANK" "GENBANK" "GENBANK" "GENBANK" ...
#> $ species_taxid : int 4577 4577 4577 381124 381124 381124 381124 381124 4579 381124 ...
#> $ species_name : chr "Zea mays" "Zea mays" "Zea mays" "Zea mays subsp. mays" ...
#> $ species_common_name : chr "maize" "maize" "maize" "maize" ...
#> $ species_ecotype : chr NA NA NA NA ...
#> $ species_strain : logi NA NA NA NA NA NA ...
#> $ species_isolate : chr NA NA NA NA ...
#> $ species_cultivar : chr "B73" "Palomero Toluqueno EDMX-2231" "B73" "W22" ...
#> $ assembly_level : Factor w/ 4 levels "Complete","Chromosome",..: 2 3 4 2 2 2 2 2 3 3 ...
#> $ assembly_status : chr "current" "current" "current" "current" ...
#> $ assembly_name : chr "B73 RefGen_v4" "ZeaMays_PT_EDMX2233_1.0" "ZmaysB73_wgs_1.0" "Zm-W22-REFERENCE-NRGENE-2.0" ...
#> $ assembly_type : chr "haploid" "haploid" "haploid" "haploid" ...
#> $ submission_date : chr "2017-02-07" "2011-08-16" "2012-07-03" "2017-02-02" ...
#> $ submitter : chr "maizesequence" "Laboratorio Nacional de Genómica para la Biodiversidad (LANGEBIO) CINVESTAV Irapuato" "maizesequence" "W22 Sequencing Consortium" ...
#> $ sequencing_technology : chr "PacBio" "454 GS20; 454 Titanium; Sanger" "454 GS FLX" "Illumina HiSeq; 10x Genomics" ...
#> $ atypical : logi FALSE TRUE TRUE FALSE FALSE FALSE ...
#> $ refseq_category : chr NA NA NA NA ...
#> $ chromosome_count : int 10 NA NA 10 10 10 10 10 NA NA ...
#> $ sequence_length : num 2.13e+09 1.77e+08 1.34e+06 2.13e+09 2.46e+09 ...
#> $ ungapped_length : num 2.10e+09 1.77e+08 1.34e+06 2.09e+09 2.43e+09 ...
#> $ contig_count : int 2787 196705 1844 68134 130125 117311 405730 182074 140894 102656 ...
#> $ contig_N50 : int 1279870 1129 949 72426 82504 96966 10873 39757 31764 71122 ...
#> $ contig_L50 : int 506 44576 434 8765 8782 7318 42688 16302 7912 8373 ...
#> $ scaffold_count : int 596 196697 NA 191 60567 62610 43299 3538 107418 48268 ...
#> $ scaffold_N50 : int 10679170 1129 NA 222590201 249966684 237276974 215148664 2707071 107689 2995073 ...
#> $ scaffold_L50 : int 62 44576 NA 5 5 5 5 233 499 168 ...
#> $ GC_percent : num 46.5 45.5 45 46.5 46.5 46.5 46 46.5 45.5 46 ...
#> $ annotation_provider : chr "maizesequence" NA NA NA ...
#> $ annotation_release_date: chr "2017-02-07" NA NA NA ...
#> $ gene_count_total : int 39320 NA NA NA NA NA NA NA NA NA ...
#> $ gene_count_coding : int 39320 NA NA NA NA NA NA NA NA NA ...
#> $ gene_count_noncoding : int NA NA NA NA NA NA NA NA NA NA ...
#> $ gene_count_pseudogene : int NA NA NA NA NA NA NA NA NA NA ...
#> $ gene_count_other : int NA NA NA NA NA NA NA NA NA NA ...
#> $ CC_ratio : num 279 NA NA 6813 13012 ...
# Example 2: get stats for all maize genomes using NCBI Taxonomy ID
maize_stats2 <- get_genome_stats(taxon = 4577)
# Checking if objects are the same
identical(maize_stats, maize_stats2)
#> [1] TRUE
As you can see, there are 95 maize genomes on the NCBI. You can also include filters in your searches by passing a list of key-value pairs with keys in list names and values in elements. For instance, to obtain only chromosome-scale and annotated maize genomes, you would run:
# Get chromosome-scale maize genomes with annotation
## Create list of filters
filt <- list(
filters.has_annotation = "true",
filters.assembly_level = "chromosome"
)
filt
#> $filters.has_annotation
#> [1] "true"
#>
#> $filters.assembly_level
#> [1] "chromosome"
## Obtain data
filtered_maize_genomes <- get_genome_stats(taxon = "Zea mays", filters = filt)
dim(filtered_maize_genomes)
#> [1] 4 36
For a full list of filtering parameters and possible arguments, see the API documentation.
Now, suppose you sequenced a genome, obtained assembly and annotation stats, and want to compare them to NCBI genomes to identify potential issues. Examples of situations you may encounter include:
The genome you assembled is huge and you think there might be a problem with your assembly.
Your gene annotation pipeline predicted n genes, but you are not sure if this number is reasonable compared to other assemblies of the same species or closely-related species.
To compare user-defined summary stats with NCBI stats, you will use
the function compare_genome_stats()
. This function will include the values
you observed for each statistic into a distribution (based on NCBI stats) and
return the percentile and rank of your observed values in each distribution.
As an example, let’s go back to our maize stats we obtained in the previous section. Suppose you sequenced a new maize genome and observed the following values:
To compare your observed values with those for publicly available maize genomes,
you need to store them in a data frame. The column accession is mandatory,
and any other column will be matched against columns in the data frame obtained
with get_genome_stats()
. Thus, make sure column names in your data frame
match column names in the reference data frame. Then, you can compare both
data frames as below:
# Check column names in the data frame of stats for maize genomes on the NCBI
names(maize_stats)
#> [1] "accession" "source"
#> [3] "species_taxid" "species_name"
#> [5] "species_common_name" "species_ecotype"
#> [7] "species_strain" "species_isolate"
#> [9] "species_cultivar" "assembly_level"
#> [11] "assembly_status" "assembly_name"
#> [13] "assembly_type" "submission_date"
#> [15] "submitter" "sequencing_technology"
#> [17] "atypical" "refseq_category"
#> [19] "chromosome_count" "sequence_length"
#> [21] "ungapped_length" "contig_count"
#> [23] "contig_N50" "contig_L50"
#> [25] "scaffold_count" "scaffold_N50"
#> [27] "scaffold_L50" "GC_percent"
#> [29] "annotation_provider" "annotation_release_date"
#> [31] "gene_count_total" "gene_count_coding"
#> [33] "gene_count_noncoding" "gene_count_pseudogene"
#> [35] "gene_count_other" "CC_ratio"
# Create a simulated data frame of stats for a maize genome
my_stats <- data.frame(
accession = "my_lovely_maize",
sequence_length = 2.4 * 1e9,
gene_count_total = 50000,
CC_ratio = 2
)
# Compare stats
compare_genome_stats(ncbi_stats = maize_stats, user_stats = my_stats)
#> accession variable percentile rank
#> 1 my_lovely_maize sequence_length 0.97916667 3
#> 2 my_lovely_maize gene_count_total 1.00000000 1
#> 3 my_lovely_maize CC_ratio 0.02898551 2
To have a visual representation of the summary stats obtained with
get_genome_stats()
, you will use the function plot_genome_stats()
.
# Summarize genome stats in a plot
plot_genome_stats(ncbi_stats = maize_stats)
Finally, you can pass your data frame of observed stats to highlight your values (as red points) in the distributions.
plot_genome_stats(ncbi_stats = maize_stats, user_stats = my_stats)
One of the most common metrics to assess gene space completeness is
BUSCO (best universal single-copy orthologs) (Simão et al. 2015).
cogeqc
allows users to run BUSCO from an R session and visualize results
graphically. BUSCO summary statistics will help you assess which assemblies
have high quality based on the percentage of complete BUSCOs.
To run BUSCO from R, you will use the function run_busco()
2 Note: You must have BUSCO installed and in your PATH to use run_busco()
. You can check if BUSCO is installed by running busco_is_installed()
. If you don’t have it already, you can manually install it or use a conda virtual environment with the Bioconductor package Herper
(Paul, Carroll, and Barrows 2021).. Here, we will use an example FASTA file containing the first 1,000 lines of the Herbaspirilllum seropedicae SmR1 genome (GCA_000143225), which was downloaded from Ensembl Bacteria. We will run BUSCO using burkholderiales_odb10 as the lineage dataset. To view all available datasets, run list_busco_datasets()
.
# Path to FASTA file
sequence <- system.file("extdata", "Hse_subset.fa", package = "cogeqc")
# Path to directory where BUSCO datasets will be stored
download_path <- paste0(tempdir(), "/datasets")
# Run BUSCO if it is installed
if(busco_is_installed()) {
run_busco(sequence, outlabel = "Hse", mode = "genome",
lineage = "burkholderiales_odb10",
outpath = tempdir(), download_path = download_path)
}
The output will be stored in the directory specified in outpath. You can read and parse BUSCO’s output with the function read_busco()
. For example, let’s read the output of a BUSCO run using the genome of the green algae Ostreococcus tauri. The output directory is /extdata
.
# Path to output directory
output_dir <- system.file("extdata", package = "cogeqc")
busco_summary <- read_busco(output_dir)
busco_summary
#> Class Frequency Lineage
#> 1 Complete_SC 1412 chlorophyta_odb10
#> 2 Complete_duplicate 4 chlorophyta_odb10
#> 3 Fragmented 35 chlorophyta_odb10
#> 4 Missing 68 chlorophyta_odb10
This is an example output for a BUSCO run with a single FASTA file. You can also specify a directory containing multiple FASTA files in the sequence argument of run_busco()
. This way, BUSCO will be run in batch mode. Let’s see what the output of BUSCO in batch mode looks like:
data(batch_summary)
batch_summary
#> Class Frequency Lineage File
#> 1 Complete_SC 98.5 burkholderiales_odb10 Hse.fa
#> 2 Complete_SC 98.8 burkholderiales_odb10 Hru.fa
#> 3 Complete_duplicate 0.7 burkholderiales_odb10 Hse.fa
#> 4 Complete_duplicate 0.7 burkholderiales_odb10 Hru.fa
#> 5 Fragmented 0.4 burkholderiales_odb10 Hse.fa
#> 6 Fragmented 0.3 burkholderiales_odb10 Hru.fa
#> 7 Missing 0.4 burkholderiales_odb10 Hse.fa
#> 8 Missing 0.2 burkholderiales_odb10 Hru.fa
The only difference between this data frame and the previous one is the column File, which contains information on the FASTA file. The example dataset batch_summary
contains the output of run_busco()
using a directory containing two genomes (Herbaspirillum seropedicae SmR1 and Herbaspirillum rubrisubalbicans M1) as parameter to the sequence argument.
After using run_busco()
and parsing its output with read_busco()
, users can visualize summary statistics with plot_busco()
.
# Single FASTA file - Ostreococcus tauri
plot_busco(busco_summary)
# Batch mode - Herbaspirillum seropedicae and H. rubrisubalbicans
plot_busco(batch_summary)
We usually consider genomes with >90% of complete BUSCOs as having high quality. Thus, we can conclude that the three genomes analyzed here are high-quality genomes.
This document was created under the following conditions:
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.3.0 RC (2023-04-13 r84269)
#> os Ubuntu 22.04.2 LTS
#> system x86_64, linux-gnu
#> ui X11
#> language (EN)
#> collate C
#> ctype en_US.UTF-8
#> tz America/New_York
#> date 2023-04-25
#> pandoc 2.7.3 @ /usr/bin/ (via rmarkdown)
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> ape 5.7-1 2023-03-13 [2] CRAN (R 4.3.0)
#> aplot 0.1.10 2023-03-08 [2] CRAN (R 4.3.0)
#> beeswarm 0.4.0 2021-06-01 [2] CRAN (R 4.3.0)
#> BiocGenerics 0.46.0 2023-04-25 [2] Bioconductor
#> BiocManager 1.30.20 2023-02-24 [2] CRAN (R 4.3.0)
#> BiocStyle * 2.28.0 2023-04-25 [2] Bioconductor
#> Biostrings 2.68.0 2023-04-25 [2] Bioconductor
#> bitops 1.0-7 2021-04-24 [2] CRAN (R 4.3.0)
#> bookdown 0.33 2023-03-06 [2] CRAN (R 4.3.0)
#> bslib 0.4.2 2022-12-16 [2] CRAN (R 4.3.0)
#> cachem 1.0.7 2023-02-24 [2] CRAN (R 4.3.0)
#> cli 3.6.1 2023-03-23 [2] CRAN (R 4.3.0)
#> cogeqc * 1.4.0 2023-04-25 [1] Bioconductor
#> colorspace 2.1-0 2023-01-23 [2] CRAN (R 4.3.0)
#> crayon 1.5.2 2022-09-29 [2] CRAN (R 4.3.0)
#> digest 0.6.31 2022-12-11 [2] CRAN (R 4.3.0)
#> dplyr 1.1.2 2023-04-20 [2] CRAN (R 4.3.0)
#> evaluate 0.20 2023-01-17 [2] CRAN (R 4.3.0)
#> fansi 1.0.4 2023-01-22 [2] CRAN (R 4.3.0)
#> farver 2.1.1 2022-07-06 [2] CRAN (R 4.3.0)
#> fastmap 1.1.1 2023-02-24 [2] CRAN (R 4.3.0)
#> generics 0.1.3 2022-07-05 [2] CRAN (R 4.3.0)
#> GenomeInfoDb 1.36.0 2023-04-25 [2] Bioconductor
#> GenomeInfoDbData 1.2.10 2023-04-17 [2] Bioconductor
#> ggbeeswarm 0.7.1 2022-12-16 [2] CRAN (R 4.3.0)
#> ggfun 0.0.9 2022-11-21 [2] CRAN (R 4.3.0)
#> ggplot2 3.4.2 2023-04-03 [2] CRAN (R 4.3.0)
#> ggplotify 0.1.0 2021-09-02 [2] CRAN (R 4.3.0)
#> ggtree 3.8.0 2023-04-25 [2] Bioconductor
#> glue 1.6.2 2022-02-24 [2] CRAN (R 4.3.0)
#> gridGraphics 0.5-1 2020-12-13 [2] CRAN (R 4.3.0)
#> gtable 0.3.3 2023-03-21 [2] CRAN (R 4.3.0)
#> highr 0.10 2022-12-22 [2] CRAN (R 4.3.0)
#> htmltools 0.5.5 2023-03-23 [2] CRAN (R 4.3.0)
#> igraph 1.4.2 2023-04-07 [2] CRAN (R 4.3.0)
#> IRanges 2.34.0 2023-04-25 [2] Bioconductor
#> jquerylib 0.1.4 2021-04-26 [2] CRAN (R 4.3.0)
#> jsonlite 1.8.4 2022-12-06 [2] CRAN (R 4.3.0)
#> knitr 1.42 2023-01-25 [2] CRAN (R 4.3.0)
#> labeling 0.4.2 2020-10-20 [2] CRAN (R 4.3.0)
#> lattice 0.21-8 2023-04-05 [3] CRAN (R 4.3.0)
#> lazyeval 0.2.2 2019-03-15 [2] CRAN (R 4.3.0)
#> lifecycle 1.0.3 2022-10-07 [2] CRAN (R 4.3.0)
#> magrittr 2.0.3 2022-03-30 [2] CRAN (R 4.3.0)
#> munsell 0.5.0 2018-06-12 [2] CRAN (R 4.3.0)
#> nlme 3.1-162 2023-01-31 [3] CRAN (R 4.3.0)
#> patchwork 1.1.2 2022-08-19 [2] CRAN (R 4.3.0)
#> pillar 1.9.0 2023-03-22 [2] CRAN (R 4.3.0)
#> pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.3.0)
#> plyr 1.8.8 2022-11-11 [2] CRAN (R 4.3.0)
#> purrr 1.0.1 2023-01-10 [2] CRAN (R 4.3.0)
#> R6 2.5.1 2021-08-19 [2] CRAN (R 4.3.0)
#> Rcpp 1.0.10 2023-01-22 [2] CRAN (R 4.3.0)
#> RCurl 1.98-1.12 2023-03-27 [2] CRAN (R 4.3.0)
#> reshape2 1.4.4 2020-04-09 [2] CRAN (R 4.3.0)
#> rlang 1.1.0 2023-03-14 [2] CRAN (R 4.3.0)
#> rmarkdown 2.21 2023-03-26 [2] CRAN (R 4.3.0)
#> S4Vectors 0.38.0 2023-04-25 [2] Bioconductor
#> sass 0.4.5 2023-01-24 [2] CRAN (R 4.3.0)
#> scales 1.2.1 2022-08-20 [2] CRAN (R 4.3.0)
#> sessioninfo 1.2.2 2021-12-06 [2] CRAN (R 4.3.0)
#> stringi 1.7.12 2023-01-11 [2] CRAN (R 4.3.0)
#> stringr 1.5.0 2022-12-02 [2] CRAN (R 4.3.0)
#> tibble 3.2.1 2023-03-20 [2] CRAN (R 4.3.0)
#> tidyr 1.3.0 2023-01-24 [2] CRAN (R 4.3.0)
#> tidyselect 1.2.0 2022-10-10 [2] CRAN (R 4.3.0)
#> tidytree 0.4.2 2022-12-18 [2] CRAN (R 4.3.0)
#> treeio 1.24.0 2023-04-25 [2] Bioconductor
#> utf8 1.2.3 2023-01-31 [2] CRAN (R 4.3.0)
#> vctrs 0.6.2 2023-04-19 [2] CRAN (R 4.3.0)
#> vipor 0.4.5 2017-03-22 [2] CRAN (R 4.3.0)
#> withr 2.5.0 2022-03-03 [2] CRAN (R 4.3.0)
#> xfun 0.39 2023-04-20 [2] CRAN (R 4.3.0)
#> XVector 0.40.0 2023-04-25 [2] Bioconductor
#> yaml 2.3.7 2023-01-23 [2] CRAN (R 4.3.0)
#> yulab.utils 0.0.6 2022-12-20 [2] CRAN (R 4.3.0)
#> zlibbioc 1.46.0 2023-04-25 [2] Bioconductor
#>
#> [1] /tmp/Rtmpoc9WwJ/Rinstabb2d307bf37d
#> [2] /home/biocbuild/bbs-3.17-bioc/R/site-library
#> [3] /home/biocbuild/bbs-3.17-bioc/R/library
#>
#> ──────────────────────────────────────────────────────────────────────────────
Paul, Matt, Thomas Carroll, and Doug Barrows. 2021. Herper: The Herper Package Is a Simple Toolset to Install and Manage Conda Packages and Environments from R. https://github.com/RockefellerUniversity/Herper.
Simão, Felipe A, Robert M Waterhouse, Panagiotis Ioannidis, Evgenia V Kriventseva, and Evgeny M Zdobnov. 2015. “BUSCO: Assessing Genome Assembly and Annotation Completeness with Single-Copy Orthologs.” Bioinformatics 31 (19): 3210–2.
Wang, Peng, and Fei Wang. 2022. “A Proposed Metric Set for Evaluation of Genome Assembly Quality.” Trends in Genetics.