This report was generated on 2024-07-20 01:20:23 using CompareM2 v2.8.2 based at /glittertind/home/carl/comparem2. Content is dynamically rendered based on the results located at /glittertind/home/carl/comparem2/tests/E._faecium.
Table 1: Overview of the samples analysed in this batch. Because mashtree has run, the samples are arranged by the order of the mashtree output.
Here is an overview of the number of result files that have been
found for each analysis. A report section is only rendered if relevant
result files are present for that analysis. Each section can be
triggered to run by calling comparem2 with a trailing
--until <section>
Table 2: Overview of sections that are rendered in this report. “n / expected” shows the number of analysis files versus how many are expected to be present. Sections are only rendered if relevant files exist. Analyses that perform comparisons between samples generally only output one set of results independent on the number of input files
rule assembly_stats
Table 3: Assembly statistics is provided by assembly-stats. N50 indicates the length of the smallest contig that (together with the longer contigs) covers at least half of the genome.
rule sequence_lengths
Fig. 1: Visualization of the length of each fasta record for each sample. The colors show the mean GC content for each contig (fasta record).
rule busco
Table 4: Table of BUSCO “BUSCO estimates the completeness and redundancy of processed genomic data based on universal single-copy orthologs.”. The following columns are printed as percents [%]: C: Complete, S: Complete and single-copy, D: Complete and duplicated, F: Fragmented, M: Missing, n: Total BUSCO groups searched. For each sample, only the best lineage match (in terms of completeness) is shown.
Fig. 2: BUSCO results visualized. Legend: S: Complete and single-copy; D: Complete and duplicated; F: Fragmented; M: Missing. For each sample, only the best lineage match (in terms of completeness) is shown.
rule gtdbtk
GTDB uses several public
repositories with reference sequences and assigns the most likely name
by measuring the average nucleotide identity (ANI) and relative
evolutionary divergence (RED).
Table 6: Species classification provided by the GTDB-tk classify_wf workflow.
rule mlst
Table 7: Table of MLST (Multi Locus Sequence Typing) results. Called with mlst which incorporates components of the PubMLST database.
Mlst automatically detects the best scheme for typing, one sample at
a time. If you don’t agree with the automatic detection, you can enforce
a single scheme across all samples by (re)running comparem2 with the
added command-line argument:
--config mlst_scheme=hpylori --forcerun mlst
. Replace
hpylori with the mlst scheme you wish to use. You can find a
full list of available schemes in the
“results_comparem2/mlst/mlst_schemes.txt”.
Using Abricate, the assemblies are scanned for known resistance genes in the NCBI, CARD, Plasmidfinder and VFDB antimicrobial resistance and virulence factor databases. Here only the results from Card and VFDB are shown.
Table 9: Table of VFDB virulence factor calls: “An integrated and comprehensive online resource for curating information about virulence factors of bacterial pathogens”.
rule prokka
Table 10: Overview of the number of different gene types. Called using the Prokka genome annotator.
rule kegg_pathway
For each genome the prokka-prodigal
called amino-acid sequences are searched in the Uniref100-KO database.
This is the same database that CheckM2 uses. For the results produced
for this analysis, the alignment criteria are stricter (>=85%
coverage and >=50% identity). Using clusterProfilers “enricher”
function, Benjamini-Hochberg adjusted p-values for the pathway
enrichment for the called genes is computed.
Fig. 3: Summary of the KEGG-ortholog based pathway enrichment analysis results. The KEGG pathway hierarchy consists of a number of pathway-classes that are listed on the vertical axis. n denotes the number of pathways from that class, that are significally enriched in each sample.
Table 11: Results from the KEGG-ortholog based pathway enrichment analysis produced with clusterProfiler::enricher. Only significant results are shown. The KOs can be entered directly into KEGG mapper search by setting mode to “Reference”.
rule dbcan
Table 12: Overview of dbcan cazyme results. Called using run_dbcan.
Fig. 4: Overview of carbohydrate active enzymes (Cazymes). The count (color) signifies the number of cazymes in each sample, that can degrade each of the listed substrates.
rule panaroo
Panaroo the pan
genome pipeline computes the number of orthologous genes in a number of
core/pan spectrum partitions.
The core genome denotes the genes which are conserved between all samples (intersection), whereas the pan genome is the union of all genes across all samples.
Table 13: Distribution of genes in different core/pan spectrum partitions.
Fig. 5: Genes shared between samples. Each vertical line represents a gene, and all lines have the same width regardless of the size of the gene. The genes are colored by the number of samples sharing them.
rule snp_dists
Counts the number of differences between
any pair of samples on the core genome produced by roary. SNP distances
do not approximate the evolutionary distance as they are not adjusted
for different probabilities for transitions and transversions etc.
Rather, they give a ballpark indication of the difference between the
samples. Note that the number of SNP distances is highly sensitive to
the core/pan genome size ratio.
Table 14: Pairwise SNP distances between all samples.
Fig. 6: Pairwise SNP distances between all samples.
rule mashtree
Mashtree computes an
approximation of ANI using the minhash distance measure. On these
distances, a phylogenetic tree is then created using the
neighbor-joining algorithm. The plotted tree is not rooted.
Fig. 7: Approximation of a phylogenetic tree calculated with mashtree. The horizontal axis is an approximation to 1-ANI. The tree is not rooted.
CompareM2 v2.8.2 genomes-to-report pipeline. Copyright (C) 2019-2024 C. M. Kobel GNU GPL v3