All data in /home/carlus/BfR/Projects/Pipeline_test/chewieSnake/test_2020-11-26
Last update on 2020-12-03.
Comparison database is /home/carlus/BfR/Projects/Pipeline_test/chewieSnake/test_2020-11-26/comparison_allele_profiles.tsv.
Last update of comparison database on 2020-11-27.
Comparison database contains 11 samples.
External database contains 4 samples.
Lowest distance between query and comparison data: 0
Number of matches between query and comparison data
Scheme contains 3000 loci
0 samples were of bad quality (having more than 5 per cent missing or dubiously called alleles ), see Allele statistics for details.
Note that only the allele statistics of the external data are shown here. For the allele statistics of the entire dataset refer to the chewieSnake report of the comparison db.
EXC - alleles which have exact matches (100% DNA identity) with previously identified alleles
INF - inferred new alleles using Prodigal CDS predictions
LNF - loci not found. No alleles were found for the number of loci in the schema shown. This means that, for those loci, there were no BLAST hits or they were not within the BSR threshold for allele assignment.
PLOT - possible loci on the tip of the query genome contigs (see image below). A locus is classified as PLOT when the CDS of the query genome has a BLAST hit with a known larger allele that covers the CDS sequence entirely and the unaligned regions of the larger allele exceeds one of the query genome contigs ends. This could be an artifact caused by genome fragmentation resulting in a shorter CDS prediction by Prodigal. To avoid locus misclassification, loci in such situations are classified as PLOT.
NIPH - non-informative paralogous hit (see image below). When ≥ 2 CDSs in the query genome match one locus in the schema with a BSR > 0.6, that locus is classified as NIPH. This suggests that such locus can have paralogous (or orthologous) loci in the query genome and should be removed from the analysis due to the potential uncertainty in allele assignment (for example, due to the presence of multiple copies of the same mobile genetic element (MGE) or as a consequence of gene duplication followed by pseudogenization). A high number of NIPH may also indicate a poorly assembled genome due to a high number of smaller contigs which result in partial CDS predictions. These partial CDSs my contain conserved domains that match multiple loci. This classification takes precedence over PLOT classification.
NIPHEM - similar to NIPH classification (NIPH with exact match), but specifically referring to exact matches. Whenever > 1 CDS matches different alleles of the same locus with 100% DNA similarity during the first DNA sequence comparison, the NIPHEM tag is attributed. The loci classified as NIPHEM are included in NIPH statistics file column, but represent a distinct classification in the MLST profile.
ALM - alleles 20% larger than length mode of the distribution of the matched loci (CDS length > (locus length mode + locus length mode * 0.2)) (see image below). This determination is based on the currently identified set of alleles for a given locus.
ASM - similar to ALM but for alleles 20% smaller than length mode distribution of the matched loci (CDS length < (locus length mode - locus length mode * 0.2)).
See here for more details.
The allele distance table compares query samples vs. samples from the comparisonDB.
All distances above 500 have been omitted.
Shown are all matches between query data and comparisonDB data with allele distance smaller 30.
The pairwise distance matrix between all query and closely related comparisonDB samples has been clustered and colored to visualize similarity of groups of closely related strains.
The figure can be zommed, exported and on hover displays the distance value (z)
The clustering is based on the allele distance matrix.
The allele distance matrix was hierarchically clustered using single-linkage clustering and the subsequent clustering was divided into different sub-clustered at the chosen thresholds 1 ,, 5 ,, 10 ,, 20 ,, 50 ,, 100 ,, 200 ,, 1000 ,. At each threshold, samples are assigned to a cluster - resulting in a cluster member number. Searching the following table provides information which samples cluster together at a chosen allele threshold.
The combined cluster member numbers at all chosen threshold correspond to the cluster address.
Note that the cluster member numbers (cluster address) may be subject to change every time the clustering is run again!
The entire tree as well as the individual trees can be exported and visualized - together with relevant metadata - in interactive tools such as grapetree, phandango etc.
Distance threshold | Cluster count |
---|---|
1 | 4 |
5 | 4 |
10 | 4 |
20 | 4 |
50 | 4 |
100 | 4 |
200 | 4 |
1000 | 4 |
In the following, the clustering analysis is visualized as a single-linkage tree. At a cluster threshold of 10, the tree is divided int sub-clusters (cluster types), indicated by the coloring of the branches.
Shown here is the single-linkage tree with the branch length defined by the allele distance and with labels colored by comparisonDB (black) and query (blue) data.
Shown here are the tree structures of the single-linkage tree for every sub-cluster defined by an allele distance of 10. Clusters containing only a single sample are excluded.
Note that the cluster numbers can be looked up in the cluster membership table above - in the field threshold_10. Furthermore, in the colored tree visualized above, every neighboring cluster is colored with a different color - starting with cluster 1 at the bottom.
### Cluster 2
### Cluster 3
### Cluster 4
The following clusters - only containing a single sample - were not plotted:
A minimum spanning tree was computed based on the allele profiles using grapetree.
Note that you can visualize the tree more interactively and combine it with metadata using tools such as grapetree or phandango. Large plots may not be optimally plotted here.
## [1] "Version 3"
All data is available at /home/carlus/BfR/Projects/Pipeline_test/chewieSnake/test_2020-11-26.
name | file | Link |
---|---|---|
distance matrix | cgmlst_joined/distance_matrix.tsv | distance matrix |
filtered distance matrix | cgmlst_joined/distance_matrix_filtered.tsv | filtered distance matrix |
Minimum spanning tree of all comparisonDB and query samples | cgmlst_joined/grapetree_mstree.tre | Minimum spanning tree of all comparisonDB and query samples |
Minimum spanning tree of related comparisonDB and query samples | cgmlst_joined/grapetree_mstree_filtered.tre | Minimum spanning tree of related comparisonDB and query samples |
All parameters are defined in the file config.yaml:
## [1] "#Created by chewieSnake.py v3.0.0"
## [2] "#chewieSnake git version is 2.0.0-166-g562e89e"
## [3] "workdir:"
## [4] " \"/home/carlus/BfR/Projects/Pipeline_test/chewieSnake/test_2020-11-26\""
## [5] "samples:"
## [6] " \"/home/carlus/BfR/Projects/Pipeline_test/chewieSnake/assemblies/samples.tsv\""
## [7] "parameters:"
## [8] " threads: 10"
## [9] " remove_frameshifts: False"
## [10] " chewbbaca:"
## [11] " bsr_theshold: 0.6"
## [12] " size_threshold: 0.2"
## [13] " scheme_dir: \"/home/carlus/BfR/Projects/Pipeline_test/chewieSnake/enterobase_senterica_cgmlst\""
## [14] " prodigal_training_file: \"/home/carlus/BfR/Snakemake/chewieSnake/chewBBACA/CHEWBBACA/prodigal_training_files/Salmonella_enterica.trn\""
## [15] " clustering:"
## [16] " clustering_method: \"single\""
## [17] " distance_threshold: 10"
## [18] " address_range: \"1,5,10,20,50,100,200,1000\""
## [19] " comparison_allele_database: \"/home/carlus/BfR/Projects/Pipeline_test/chewieSnake/test_2020-11-26/comparison_allele_profiles.tsv\""
## [20] " joining_threshold: 30"
## [21] " grapetree_distance_method: 3"
## [22] " frameshift_removal:"
## [23] " remove_frameshifts: False"
## [24] " mode: relative"
## [25] " threshold: 0.1"
The individual tools and their versions are listed in the .yaml files in the directory env