This report was generated on 2024-07-20 03:32:22 using CompareM2 v2.8.2 based at /glittertind/home/carl/comparem2. Content is dynamically rendered based on the results located at /glittertind/home/carl/comparem2/tests/strachan_campylo.

Samples

Table 1: Overview of the samples analysed in this batch. Because mashtree has run, the samples are arranged by the order of the mashtree output.

Report sections

Here is an overview of the number of result files that have been found for each analysis. A report section is only rendered if relevant result files are present for that analysis. Each section can be triggered to run by calling comparem2 with a trailing --until <section>

Table 2: Overview of sections that are rendered in this report. “n / expected” shows the number of analysis files versus how many are expected to be present. Sections are only rendered if relevant files exist. Analyses that perform comparisons between samples generally only output one set of results independent on the number of input files


Assembly statistics

rule assembly_stats

Table 3: Assembly statistics is provided by assembly-stats. N50 indicates the length of the smallest contig that (together with the longer contigs) covers at least half of the genome.


Contig sizes and GC-content

rule sequence_lengths

Fig. 1: Visualization of the length of each fasta record for each sample. The colors show the mean GC content for each contig (fasta record).


BUSCO

rule busco

Table 4: Table of BUSCO “BUSCO estimates the completeness and redundancy of processed genomic data based on universal single-copy orthologs.”. The following columns are printed as percents [%]: C: Complete, S: Complete and single-copy, D: Complete and duplicated, F: Fragmented, M: Missing, n: Total BUSCO groups searched. For each sample, only the best lineage match (in terms of completeness) is shown.

Fig. 2: BUSCO results visualized. Legend: S: Complete and single-copy; D: Complete and duplicated; F: Fragmented; M: Missing. For each sample, only the best lineage match (in terms of completeness) is shown.


Checkm2

rule checkm2

Table 5: Checkm2 results.


GTDB taxonomical classification

rule gtdbtk GTDB uses several public repositories with reference sequences and assigns the most likely name by measuring the average nucleotide identity (ANI) and relative evolutionary divergence (RED).

Table 6: Species classification provided by the GTDB-tk classify_wf workflow.


MLST

rule mlst

Table 7: Table of MLST (Multi Locus Sequence Typing) results. Called with mlst which incorporates components of the PubMLST database.

How to customize the mlst-analysis

Mlst automatically detects the best scheme for typing, one sample at a time. If you don’t agree with the automatic detection, you can enforce a single scheme across all samples by (re)running comparem2 with the added command-line argument: --config mlst_scheme=hpylori --forcerun mlst. Replace hpylori with the mlst scheme you wish to use. You can find a full list of available schemes in the “results_comparem2/mlst/mlst_schemes.txt”.


Antimicrobial Resistance

Using Abricate, the assemblies are scanned for known resistance genes in the NCBI, CARD, Plasmidfinder and VFDB antimicrobial resistance and virulence factor databases. Here only the results from Card and VFDB are shown.

CARD

Table 8: Table of CARD (Comprehensive Antibiotic Resistance Database) results


VFDB

Table 9: Table of VFDB virulence factor calls: “An integrated and comprehensive online resource for curating information about virulence factors of bacterial pathogens”.


Genomic annotation

rule prokka

Table 10: Overview of the number of different gene types. Called using the Prokka genome annotator.


KEGG pathway enrichment analysis

rule kegg_pathway For each genome the prokka-prodigal called amino-acid sequences are searched in the Uniref100-KO database. This is the same database that CheckM2 uses. For the results produced for this analysis, the alignment criteria are stricter (>=85% coverage and >=50% identity). Using clusterProfilers “enricher” function, Benjamini-Hochberg adjusted p-values for the pathway enrichment for the called genes is computed.

Fig. 3: Summary of the KEGG-ortholog based pathway enrichment analysis results. The KEGG pathway hierarchy consists of a number of pathway-classes that are listed on the vertical axis. n denotes the number of pathways from that class, that are significally enriched in each sample.

Table 11: Results from the KEGG-ortholog based pathway enrichment analysis produced with clusterProfiler::enricher. Only significant results are shown. The KOs can be entered directly into KEGG mapper search by setting mode to “Reference”.


Cazymes from dbCAN

rule dbcan

Table 12: Overview of dbcan cazyme results. Called using run_dbcan.

Fig. 4: Overview of carbohydrate active enzymes (Cazymes). The count (color) signifies the number of cazymes in each sample, that can degrade each of the listed substrates.


Pan and Core genome

rule panaroo Panaroo the pan genome pipeline computes the number of orthologous genes in a number of core/pan spectrum partitions.

The core genome denotes the genes which are conserved between all samples (intersection), whereas the pan genome is the union of all genes across all samples.

Table 13: Distribution of genes in different core/pan spectrum partitions.

Fig. 5: Genes shared between samples. Each vertical line represents a gene, and all lines have the same width regardless of the size of the gene. The genes are colored by the number of samples sharing them.


SNP distances

rule snp_dists Counts the number of differences between any pair of samples on the core genome produced by roary. SNP distances do not approximate the evolutionary distance as they are not adjusted for different probabilities for transitions and transversions etc. Rather, they give a ballpark indication of the difference between the samples. Note that the number of SNP distances is highly sensitive to the core/pan genome size ratio.

Table 14: Pairwise SNP distances between all samples.

Fig. 6: Pairwise SNP distances between all samples.


Mashtree phylogeny

rule mashtree Mashtree computes an approximation of ANI using the minhash distance measure. On these distances, a phylogenetic tree is then created using the neighbor-joining algorithm. The plotted tree is not rooted.

Fig. 7: Approximation of a phylogenetic tree calculated with mashtree. The horizontal axis is an approximation to 1-ANI. The tree is not rooted.


CompareM2 v2.8.2 genomes-to-report pipeline. Copyright (C) 2019-2024 C. M. Kobel GNU GPL v3