User carlus executed the pipeline.
Last update on 2020-11-27.
Database contains 3 samples.
In the last 30 days 4 samples were added. The ten most recently added samples are GMI15-001-DNA, GMI15-002-DNA, GMI-17-001-DNA, GMI-17-002-DNA.
0 samples were of bad quality (having more than 5 per cent missing or dubiously called alleles ), see Allele statistics for details. Bad samples:
Clustering the allele distance reveals the following number of clusters for different distance thresholds
Distance threshold | Cluster count |
---|---|
1 | 3 |
5 | 3 |
10 | 3 |
20 | 3 |
50 | 3 |
100 | 3 |
200 | 3 |
1000 | 3 |
See section Clustering for more details.
EXC - alleles which have exact matches (100% DNA identity) with previously identified alleles
INF - inferred new alleles using Prodigal CDS predictions
LNF - loci not found. No alleles were found for the number of loci in the schema shown. This means that, for those loci, there were no BLAST hits or they were not within the BSR threshold for allele assignment.
PLOT - possible loci on the tip of the query genome contigs (see image below). A locus is classified as PLOT when the CDS of the query genome has a BLAST hit with a known larger allele that covers the CDS sequence entirely and the unaligned regions of the larger allele exceeds one of the query genome contigs ends. This could be an artifact caused by genome fragmentation resulting in a shorter CDS prediction by Prodigal. To avoid locus misclassification, loci in such situations are classified as PLOT.
NIPH - non-informative paralogous hit (see image below). When ≥ 2 CDSs in the query genome match one locus in the schema with a BSR > 0.6, that locus is classified as NIPH. This suggests that such locus can have paralogous (or orthologous) loci in the query genome and should be removed from the analysis due to the potential uncertainty in allele assignment (for example, due to the presence of multiple copies of the same mobile genetic element (MGE) or as a consequence of gene duplication followed by pseudogenization). A high number of NIPH may also indicate a poorly assembled genome due to a high number of smaller contigs which result in partial CDS predictions. These partial CDSs my contain conserved domains that match multiple loci. This classification takes precedence over PLOT classification.
NIPHEM - similar to NIPH classification (NIPH with exact match), but specifically referring to exact matches. Whenever > 1 CDS matches different alleles of the same locus with 100% DNA similarity during the first DNA sequence comparison, the NIPHEM tag is attributed. The loci classified as NIPHEM are included in NIPH statistics file column, but represent a distinct classification in the MLST profile.
ALM - alleles 20% larger than length mode of the distribution of the matched loci (CDS length > (locus length mode + locus length mode * 0.2)) (see image below). This determination is based on the currently identified set of alleles for a given locus.
ASM - similar to ALM but for alleles 20% smaller than length mode distribution of the matched loci (CDS length < (locus length mode - locus length mode * 0.2)).
See here for more details.
Fraction of loci found - Sum of EXC and INF
Fraction of loci missing - 1 minus Sum of EXC and INF
Loci removed by length - Loci removed due to frameshiftering filter (if applied). Loci strongly deviate from median locus length - as specified in the parameters. This filter is stricter than ALM and ASM and does not appear in Fraction of loci missing
Table of all pairwise allele distances.
All distances are displayed.
The number of loci in the scheme is 3000. Allele differences should be interpreted relative to the number of loci!
## [1] "All distances are displayed."
The pairwise distance matrix has been clustered and colored to visualize similarity of groups of closely related strains.
The clustering is based on the allele distance matrix.
The allele distance matrix was hierarchically clustered using single-linkage clustering and the subsequent clustering was divided into different sub-clusters at the chosen thresholds: 1 ,, 5 ,, 10 ,, 20 ,, 50 ,, 100 ,, 200 ,, 1000 ,. At each threshold, samples are assigned to a cluster - resulting in a cluster member number. Searching the following table provides information which samples cluster together at a chosen allele threshold.
The combined cluster member numbers at all chosen threshold correspond to the cluster address.
Note that the cluster member numbers (cluster address) may be subject to change every time the clustering is run again!
The entire tree as well as the individual trees can be exported and visualized - together with relevant metadata - in interactive tools such as grapetree, phandango etc. The files were exported to the directory /home/carlus/BfR/Projects/Pipeline_test/chewieSnake/test_2020-11-26/cgmlst/exported_trees
Distance threshold | Cluster count |
---|---|
1 | 3 |
5 | 3 |
10 | 3 |
20 | 3 |
50 | 3 |
100 | 3 |
200 | 3 |
1000 | 3 |
In the following, the clustering analysis is visualized as a single-linkage tree. At a cluster threshold of 10, the tree is divided int sub-clusters (cluster types), indicated by the coloring of the branches.
Shown here are the tree structures of the single-linkage tree for every sub-cluster defined by an allele distance of 10. Clusters containing only a single sample are excluded.
Note that the cluster numbers can be looked up in the cluster membership table above - in the field threshold_10. Furthermore, in the colored tree visualized above, every neighboring cluster is colored with a different color - starting with cluster 1 at the bottom.
The following clusters - only containing single samples - were not plotted: 1,2,3
A minimum spanning tree was computed based on the allele profiles using grapetree.
Note that you can visualize the tree more interactively and combine it with metadata using tools such as grapetree or phandango. Large plots may not be optimally plotted here.
All data is available at /home/carlus/BfR/Projects/Pipeline_test/chewieSnake/test_2020-11-26.
The exported clustering files are in /home/carlus/BfR/Projects/Pipeline_test/chewieSnake/test_2020-11-26/cgmlst/exported_trees
name | file | Link |
---|---|---|
distance matrix (as provided by grapetree) | /home/carlus/BfR/Projects/Pipeline_test/chewieSnake/test_2020-11-26/cgmlst/distance_matrix.tsv | distance matrix (as provided by grapetree) |
distance matrix (with header and tab separated) | /home/carlus/BfR/Projects/Pipeline_test/chewieSnake/test_2020-11-26/reports/allele_distance_matrix.tsv | distance matrix (with header and tab separated) |
Minimum spanning tree | /home/carlus/BfR/Projects/Pipeline_test/chewieSnake/test_2020-11-26/cgmlst/grapetree_mstree.tre | Minimum spanning tree |
name | file | Link |
---|---|---|
cluster_addresses.tsv | /home/carlus/BfR/Projects/Pipeline_test/chewieSnake/test_2020-11-26/cgmlst/exported_trees/cluster_addresses.tsv | cluster_addresses.tsv |
clustering_global.tre | /home/carlus/BfR/Projects/Pipeline_test/chewieSnake/test_2020-11-26/cgmlst/exported_trees/clustering_global.tre | clustering_global.tre |
All parameters are defined in the file config.yaml:
## $threads
## [1] 10
##
## $remove_frameshifts
## [1] FALSE
##
## $chewbbaca
## $chewbbaca$bsr_theshold
## [1] 0.6
##
## $chewbbaca$size_threshold
## [1] 0.2
##
## $chewbbaca$scheme_dir
## [1] "/home/carlus/BfR/Projects/Pipeline_test/chewieSnake/enterobase_senterica_cgmlst"
##
## $chewbbaca$prodigal_training_file
## [1] "/home/carlus/BfR/Snakemake/chewieSnake/chewBBACA/CHEWBBACA/prodigal_training_files/Salmonella_enterica.trn"
##
##
## $clustering
## $clustering$clustering_method
## [1] "single"
##
## $clustering$distance_threshold
## [1] 10
##
## $clustering$address_range
## [1] "1,5,10,20,50,100,200,1000"
##
## $clustering$comparison_allele_database
## [1] "None"
##
## $clustering$joining_threshold
## [1] 30
##
## $clustering$grapetree_distance_method
## [1] 3
##
##
## $frameshift_removal
## $frameshift_removal$remove_frameshifts
## [1] FALSE
##
## $frameshift_removal$mode
## [1] "relative"
##
## $frameshift_removal$threshold
## [1] 0.1
The individual tools and their versions are listed in the .yaml files in the directory env