User carlus executed the pipeline.
Last update on 2020-11-24.
Database contains 12 samples.
NA samples were of bad quality (having more than 5 per cent missing or dubiously called alleles ), see Allele statistics for details. Bad samples: NA
recent updates (last 14 days) for clusters: None
partnerA | partnerB | partnerC | |
---|---|---|---|
partnerA | 4 | 4 | 4 |
partnerB | 4 | 4 | 4 |
partnerC | 4 | 4 | 4 |
partnerA | partnerB | partnerC | |
---|---|---|---|
partnerA | 8 | 8 | 8 |
partnerB | 8 | 8 | 8 |
partnerC | 8 | 8 | 8 |
Shown is a summary of each cluster (containing at least 2 samples within allele distance 10).
Column desciption
Shown is a table of all samples and their cluster association. Samples that are not found within an allele distance of 10 are termed Orphans.
Column desciption
List of orphan samples (not related to any other sample in dataset) and their minimum distance to the cluster’s representative samples.
Note that the orphan numbers are not stable and may change.
The distance of each orphan sample to any of the cluster representative samples is shown in section Distance from samples to clusters.
Shown here are the distances between representative samples of each cluster.
The representative samples of each clusters were clustered using single-linkage hierarchical clustering. The coloring follows from a distance threshold of 10.
Shown here is a distance table which contains distances between all samples in column Sample and selected representative samples of each cluster in column Cluster_representative.
Note that distances larger than 50 were omitted for performance reasons.
The allele quality of each sample can be assessed from the fraction of missing loci. More details are available in the subsequent columns.
## [1] "No allele stats provided"
EXC - alleles which have exact matches (100% DNA identity) with previously identified alleles
INF - inferred new alleles using Prodigal CDS predictions
LNF - loci not found. No alleles were found for the number of loci in the schema shown. This means that, for those loci, there were no BLAST hits or they were not within the BSR threshold for allele assignment.
PLOT - possible loci on the tip of the query genome contigs (see image below). A locus is classified as PLOT when the CDS of the query genome has a BLAST hit with a known larger allele that covers the CDS sequence entirely and the unaligned regions of the larger allele exceeds one of the query genome contigs ends. This could be an artifact caused by genome fragmentation resulting in a shorter CDS prediction by Prodigal. To avoid locus misclassification, loci in such situations are classified as PLOT.
NIPH - non-informative paralogous hit (see image below). When ≥ 2 CDSs in the query genome match one locus in the schema with a BSR > 0.6, that locus is classified as NIPH. This suggests that such locus can have paralogous (or orthologous) loci in the query genome and should be removed from the analysis due to the potential uncertainty in allele assignment (for example, due to the presence of multiple copies of the same mobile genetic element (MGE) or as a consequence of gene duplication followed by pseudogenization). A high number of NIPH may also indicate a poorly assembled genome due to a high number of smaller contigs which result in partial CDS predictions. These partial CDSs my contain conserved domains that match multiple loci. This classification takes precedence over PLOT classification.
NIPHEM - similar to NIPH classification (NIPH with exact match), but specifically referring to exact matches. Whenever > 1 CDS matches different alleles of the same locus with 100% DNA similarity during the first DNA sequence comparison, the NIPHEM tag is attributed. The loci classified as NIPHEM are included in NIPH statistics file column, but represent a distinct classification in the MLST profile.
ALM - alleles 20% larger than length mode of the distribution of the matched loci (CDS length > (locus length mode + locus length mode * 0.2)) (see image below). This determination is based on the currently identified set of alleles for a given locus.
ASM - similar to ALM but for alleles 20% smaller than length mode distribution of the matched loci (CDS length < (locus length mode - locus length mode * 0.2)).
See here for more details.
All data is available at /home/carlus/BfR/Projects/Pipeline_test/chewieSnake/chewiesnake_join/joined_db.
name | file | Link |
---|---|---|
distance matrix (as provided by grapetree) | distance_matrix.tsv | distance matrix (as provided by grapetree) |
distance matrix (with header and tab separated) | allele_distance_matrix.tsv | distance matrix (with header and tab separated) |
Minimum spanning tree | grapetree_mstree.tre | Minimum spanning tree |
cluster summary file | cluster_summary.tsv | cluster summary file |
samples in clusters | sample_cluster_information.tsv | samples in clusters |
orphan samples | sample_orphan_information.tsv | orphan samples |
clustering_intercluster.file | clustering_intercluster.tre | clustering_intercluster.file |
All parameters are defined in the file config.yaml:
## $clustering_method
## [1] "single"
##
## $distance_threshold
## [1] 10
##
## $grapetree_distance_method
## [1] 3
##
## $cluster_names_reservoir
## [1] "/home/carlus/BfR/Snakemake/chewieSnake/scripts/cluster_names_reservoir.txt"
##
## $subcluster_distance_thresholds
## [1] 3
##
## $subcluster_name_types
## [1] "greek-alphabet"
##
## $external_cluster_names
## [1] "/home/carlus/BfR/Projects/Pipeline_test/chewieSnake/chewiesnake_join/epiclusters.tsv"
##
## $serovar_info
## [1] "/home/carlus/BfR/Projects/Pipeline_test/chewieSnake/chewiesnake_join/sample_serovars.tsv"
##
## $do_serovar_info
## [1] TRUE
The individual tools and their versions are listed in the .yaml files in the directory env