cgMLST report for joined analysis

User carlus executed the pipeline.

Overview

  • Last update on 2020-11-24.

  • Database contains 12 samples.

  • NA samples were of bad quality (having more than 5 per cent missing or dubiously called alleles ), see Allele statistics for details. Bad samples: NA

  • recent updates (last 14 days) for clusters: None

Provided samples per partner

Provided samples per date

Matches between partners

Number of clusters that match beteen partners
partnerA partnerB partnerC
partnerA 4 4 4
partnerB 4 4 4
partnerC 4 4 4
Number of samples with matches in same cluster
partnerA partnerB partnerC
partnerA 8 8 8
partnerB 8 8 8
partnerC 8 8 8

Cluster summary

Cluster summary

Shown is a summary of each cluster (containing at least 2 samples within allele distance 10).

  • You can identify clusters of interest by searching for recent additions of samples, counts per sending institution, etc. and a combination thereof
  • You can check for updates of a specific cluster by searching for number, name or representative sample.
  • More information about samples in each cluster can be inspected in the table Samples in clusters below or in tab Cluster details

Column desciption

  • Cluster name: stable naming scheme (ancient greek writers)
  • Serovar: The serovar (assigned to all samples in a cluster, where available)
  • Cluster name long: Combination of cluster information elements, {species}{serovar}{year_of_oldest_sample}_{cluster_name}
  • Representative sample: Oldest sample in cluster, if more than one has same date, the first sample in alphabetical order is chosen (this is supposed to be a rather stable number)
  • timestamp_mostrecent, mostrecent_*: Date when most recent sample was analysed
  • timestamp_mostancient: Date of analysis of most ancient sample in cluster sample was analysed
  • age: Number of days since most recent sample was analysed
  • min_diff: Minimum allele distance between samples from different senders
  • which_min_diff: Which sender combination the min_diff value originates from
  • min_same: Minimum allele distance between samples from same sender
  • external_cluster_name: clusters that match to representative sample of externally defined cluster (e.g. EPI-cluster)
  • representative_sample_externalcluster: representative sample of externally defined cluster (e.g. EPI-cluster)
  • Details: Link to cluster report with more details on cluster

Samples in clusters

Shown is a table of all samples and their cluster association. Samples that are not found within an allele distance of 10 are termed Orphans.

Column desciption

  • Sample: Sample name
  • Sender: Information about the sending organization
  • Timestamp: Time of cgMLST analysis
  • Cluster_name: stable cluster nomenclature for clustering at an allele distance of 10 (ancient greek writers)
  • Representative_sample: Oldest sample in cluster, if more than one has same date, the first sample in alphabetical order is chosen (this is supposed to be a rather stable number)
  • external_cluster_name: samples that match to representative sample of externally defined cluster (e.g. EPI-cluster) within a distance threshold of 10
  • representative_sample_externalcluster: representative sample of externally defined cluster (e.g. EPI-cluster)
  • Cluster_name_10,Cluster_name_3: stable cluster nomenclature for clustering at distance 10,3, etc. (numeric or greek letter)
  • cluster_code: Combination of the different cluster names at different allele distance thresholds: 10.10.3

Orphans

List of orphan samples (not related to any other sample in dataset) and their minimum distance to the cluster’s representative samples.

Note that the orphan numbers are not stable and may change.

The distance of each orphan sample to any of the cluster representative samples is shown in section Distance from samples to clusters.

Inter-cluster relation

Shown here are the distances between representative samples of each cluster.

Distance matrix between clusters

Distance table between clusters

Inter-cluster relation as a single-linkage tree

The representative samples of each clusters were clustered using single-linkage hierarchical clustering. The coloring follows from a distance threshold of 10.

Distance from samples to clusters

Shown here is a distance table which contains distances between all samples in column Sample and selected representative samples of each cluster in column Cluster_representative.

Note that distances larger than 50 were omitted for performance reasons.

Allele QC

The allele quality of each sample can be assessed from the fraction of missing loci. More details are available in the subsequent columns.

  • Samples with fraction of missing loci < 0,05 are marked green.
  • Samples with fraction of missing loci < 0,1 are marked yellow.
  • Samples with fraction of missing loci > 0,1 are marked red.
## [1] "No allele stats provided"

Abbreviations

  • EXC - alleles which have exact matches (100% DNA identity) with previously identified alleles

  • INF - inferred new alleles using Prodigal CDS predictions

  • LNF - loci not found. No alleles were found for the number of loci in the schema shown. This means that, for those loci, there were no BLAST hits or they were not within the BSR threshold for allele assignment.

  • PLOT - possible loci on the tip of the query genome contigs (see image below). A locus is classified as PLOT when the CDS of the query genome has a BLAST hit with a known larger allele that covers the CDS sequence entirely and the unaligned regions of the larger allele exceeds one of the query genome contigs ends. This could be an artifact caused by genome fragmentation resulting in a shorter CDS prediction by Prodigal. To avoid locus misclassification, loci in such situations are classified as PLOT.

  • NIPH - non-informative paralogous hit (see image below). When ≥ 2 CDSs in the query genome match one locus in the schema with a BSR > 0.6, that locus is classified as NIPH. This suggests that such locus can have paralogous (or orthologous) loci in the query genome and should be removed from the analysis due to the potential uncertainty in allele assignment (for example, due to the presence of multiple copies of the same mobile genetic element (MGE) or as a consequence of gene duplication followed by pseudogenization). A high number of NIPH may also indicate a poorly assembled genome due to a high number of smaller contigs which result in partial CDS predictions. These partial CDSs my contain conserved domains that match multiple loci. This classification takes precedence over PLOT classification.

  • NIPHEM - similar to NIPH classification (NIPH with exact match), but specifically referring to exact matches. Whenever > 1 CDS matches different alleles of the same locus with 100% DNA similarity during the first DNA sequence comparison, the NIPHEM tag is attributed. The loci classified as NIPHEM are included in NIPH statistics file column, but represent a distinct classification in the MLST profile.

  • ALM - alleles 20% larger than length mode of the distribution of the matched loci (CDS length > (locus length mode + locus length mode * 0.2)) (see image below). This determination is based on the currently identified set of alleles for a given locus.

  • ASM - similar to ALM but for alleles 20% smaller than length mode distribution of the matched loci (CDS length < (locus length mode - locus length mode * 0.2)).

See here for more details.

Config and parameters

All parameters are defined in the file config.yaml:

## $clustering_method
## [1] "single"
## 
## $distance_threshold
## [1] 10
## 
## $grapetree_distance_method
## [1] 3
## 
## $cluster_names_reservoir
## [1] "/home/carlus/BfR/Snakemake/chewieSnake/scripts/cluster_names_reservoir.txt"
## 
## $subcluster_distance_thresholds
## [1] 3
## 
## $subcluster_name_types
## [1] "greek-alphabet"
## 
## $external_cluster_names
## [1] "/home/carlus/BfR/Projects/Pipeline_test/chewieSnake/chewiesnake_join/epiclusters.tsv"
## 
## $serovar_info
## [1] "/home/carlus/BfR/Projects/Pipeline_test/chewieSnake/chewiesnake_join/sample_serovars.tsv"
## 
## $do_serovar_info
## [1] TRUE

The individual tools and their versions are listed in the .yaml files in the directory env

Help

Grapetree instructions

  • Download / copy minimum spanning tree
  • Open grapetree
  • Load tree file
  • Load one or more metadata tsv files with the sample name in the first column (e.g. bakcharak summary data, geospatial/temporal metadata)
  • E.g. subset data to particular clade and export data as json for sharing

Other tools

  • iTOL
  • phandango