AQUAMIS is a pipeline for routine assembly and quality assessment of microbial isolate sequencing experiments. It is based on snakemake and includes the following tools:
It will read untrimmed fastq data from your Illumina sequencing experiments as paired .fastq.gz-files. These are then trimmed, assembled and polished. Besides generating ready-to-use contigs, AQUAMIS will select the closest reference genome from NCBI RefSeq and produce an intuitive, detailed report on your data and assemblies to evaluate its reliability for further analyses. It relies on reference-based and reference-free measures such as coverage depth, gene content, genome completeness and contamination, assembly length and many more. Based on the experience from thousands of sequencing experiments, threshold sets for different species have been defined to detect potentially poor results.
The AQUAMIS project website is https://gitlab.com/bfr_bioinformatics/AQUAMIS
There, you can find the latest version, source code and documentation.
You can find an example AQUAMIS report here.
You can install AQUAMIS by installing the Bioconda package, by installing the Docker container or by cloning this repository and installing all dependencies with conda. AQUAMIS relies on the conda package manager for all dependencies. Please set up conda on your system as explained here. It is advised to use mamba instead of conda for resolving all software requirements (Install it via conda install mamba
first or with the provided installation script:
Placeholder | Path |
---|---|
<path_to_conda> |
is the conda installation folder, type conda info --base to retrieve its absolute path, typically ~/anaconda3 or ~/miniconda3 |
<path_to_envs> |
is the folder that holds your conda environments, typically <path_to_conda>/envs |
<path_to_installation> |
is the parent folder of the AQUAMIS repository |
<path_to_aquamis> |
is the base folder of the AQUAMIS repository, i.e. <path_to_installation>/AQUAMIS |
<path_to_databases> |
is the parent folder of your databases, by default, AQUAMIS uses <path_to_aquamis>/reference_db , but you are free to choose a custom location |
<path_to_data> |
is the working directory for an AQUAMIS analysis typically containing a subfolder <path_to_data>/fastq with your fastq read files |
To install the latest stable version of AQUAMIS, please clone the git repository on your system.
cd <path_to_installation>
git clone https://gitlab.com/bfr_bioinformatics/AQUAMIS.git
AQUAMIS relies on the package manager conda
for all dependencies. Please set up conda on your system as explained here.
Next, please execute the setup script with the appropriate options:
<path_to_aquamis>/scripts/aquamis_setup.sh --help
to install the conda dependency manager mamba
, create the conda environment aquamis
and install external databases within the default folder <path_to_aquamis>/reference_db
.
Alternatively, please initialize a conda base environment containing snakemake
and mamba
(mamba is faster in resolving dependencies), then:
mamba env create -f <path_to_aquamis>/envs/aquamis.yaml
This creates an environment named aquamis
containing all dependencies. It is found under <path_to_conda>/envs/aquamis
.
For custom database paths, please see the chapter Database setup.
mamba create -n aquamis -c conda-forge -c bioconda aquamis
Please complement the Bioconda installation with reference databases of your choice or via the setup script (see previous chapter From Source). This setup script is also available from within the Bioconda installation after conda activate aquamis
or, for direct execution, from the Bioconda installation path <path_to_envs>/aquamis/opt/aquamis/scripts/aquamis_setup.sh
mamba
is the recommended conda package dependency resolver.
Prerequisite: Install the Docker engine for your favourite operating system (e.g. Ubuntu Linux).
Download the latest version of AQUAMIS from Docker Hub and note down the Docker Image ID on your system (hereafter refered as $docker_image_id) with the shell commands:
docker pull bfrbioinformatics/aquamis:latest
docker image list | grep "aquamis" | grep "latest" | awk '{ print $3 }'
To process data and write results, Docker needs a volume mapping from a host directory containing your sequence data (<path_to_data>
) to the Docker container (/AQUAMIS/analysis
). Your sample list (samples.tsv
) needs to be located within <path_to_data>
and contain relative paths to your NGS reads in the same or another child directory. You may generate a Docker-compatible sample list in your host directory (<path_to_data>/samples.tsv
) by executing the create_sampleSheet.sh
from the container with the following terminal commands:
host:<path_to_data>$ ls fastq/
sample1_R1.fastq sample1_R2.fastq sample2_R2.fastq sample2_R2.fastq
docker run --rm \
-v <path_to_data>:/AQUAMIS/analysis \
-e HOST_PATH=<path_to_data> \
-e LOCAL_USER_ID=$(id -u $USER) \
--entrypoint bash \
$docker_image_id \
/AQUAMIS/scripts/create_sampleSheet.sh --mode ncbi \
--fastxDir /AQUAMIS/analysis/fastq \
--outDir /AQUAMIS/analysis
With the following command, AQUAMIS is started within the Docker container and will process any options appended:
docker run --rm \
-v <path_to_data>:/AQUAMIS/analysis \
-e HOST_PATH=<path_to_data> \
-e LOCAL_USER_ID=$(id -u $USER) \
$docker_image_id \
--condaprefix /opt/conda/envs \
--sample_list /AQUAMIS/analysis/samples.tsv \
--working_directory /AQUAMIS/analysis \
--<any_other_AQUAMIS_options>
Notes on Docker images: The Docker image conda represents the latest build from the Gitlab repository. The Docker image latest contains additional reference databases (also provided via the setup script) as well as a set of test data files (fastq) for validation purposes.
Notes on Docker usage: The container path /AQUAMIS/analysis
is fixed and may not be altered. Any subdirectories of <path_to_data>
will be available as subdirectories under /AQUAMIS/analysis/
. Our container is able to write results with the Linux user and group ID of your choice (UID
and GID
, respectively) to blend into your host file permission setup. With the above option -e LOCAL_USER_ID==$(id -u $USER)
the UID of the currently executing user is inherited, change it according to your needs. The absolute host path mapped to the container has to be provided as the environment variable $HOST_PATH
, too. It is used for correcting file paths in the result JSON files of each sample to match the host perspective by using the wrapper argument --docker
automatically.
To run AQUAMIS, source the conda environment aquamis
and call the wrapper script:
conda activate aquamis
python3 aquamis.py --help
usage: aquamis.py [-h, --help] [-V, --version]
-l SAMPLE_LIST -d WORKING_DIRECTORY
[-s, --snakefile SNAKEFILE]
[-r, --run_name RUN_NAME] [--docker DOCKER]
[--qc_thresholds QC_THRESHOLDS] [--json_schema JSON_SCHEMA]
[--json_filter JSON_FILTER] [-m, --mashdb MASHDB]
[--mash_kmersize MASH_KMERSIZE]
[--mash_sketchsize MASH_SKETCHSIZE] [--kraken2db KRAKEN2DB]
[--taxlevel_qc TAXLEVEL_QC] [--read_length READ_LENGTH]
[--taxonkit_db TAXONKIT_DB]
[--confindr_database CONFINDR_DATABASE]
[--min_trimmed_length MIN_TRIMMED_LENGTH]
[--assembler ASSEMBLER]
[--shovill_output_options SHOVILL_OUTPUT_OPTIONS]
[--shovill_depth SHOVILL_DEPTH] [--shovill_ram SHOVILL_RAM]
[--shovill_tmpdir SHOVILL_TMPDIR]
[--shovill_extraopts SHOVILL_EXTRAOPTS]
[--shovill_modules SHOVILL_MODULES]
[--mlst_scheme MLST_SCHEME] [-t, --threads THREADS]
[--threads_sample THREADS_SAMPLE] [-c, --condaprefix CONDAPREFIX]
[-n, --dryrun] [-f, --force RULE]
[--forceall] [--fix_fails] [--unlock]
[--no_assembly] [--ephemeral] [--remove_temp] [--use_conda]
[--conda_frontend]
optional arguments:
-h, --help show this help message and exit
-V, --version Print program version.
-l SAMPLE_LIST, --sample_list SAMPLE_LIST
List of samples to assemble, format as defined by ...
-d WORKING_DIRECTORY, --working_directory WORKING_DIRECTORY
Working directory
-s SNAKEFILE, --snakefile SNAKEFILE
Path to Snakefile of bakcharak pipeline; default: path
to Snakefile in same directory
-r RUN_NAME, --run_name RUN_NAME
Name of the sequencing run for all samples in the
sample list, e.g.
"210401_M02387_0709_000000000-HBXX6", or a self-chosen
analysis title
--docker DOCKER Mapped volume path of the host system
--qc_thresholds QC_THRESHOLDS
Definition of thresholds in JSON file; default:
<path_to_aquamis>/thresholds/AQUAMIS_thresholds.json
--json_schema JSON_SCHEMA
JSON schema used for validation; default:
<path_to_aquamis>/resources/AQUAMIS_schema_v20210226.json
--json_filter JSON_FILTER
Definition of thresholds in JSON file; default:
<path_to_aquamis>/thresholds/AQUAMIS_schema_filter_v20210226.json
-m MASHDB, --mashdb MASHDB
Path to reference mash database; default:
<path_to_aquamis>/reference_db/mash/mashDB.msh
--mash_kmersize MASH_KMERSIZE
kmer size for mash, must match size of database;
default: 21
--mash_sketchsize MASH_SKETCHSIZE
sketch size for mash, must match size of database;
default: 1000
--kraken2db KRAKEN2DB
Path to kraken2 database; default:
<path_to_aquamis>/reference_db/kraken2
--taxlevel_qc TAXLEVEL_QC
Taxonomic level for kraken2 classification quality
control. Choose S for species or G for genus; default:
"G"
--read_length READ_LENGTH
Read length to be used in bracken abundance
estimation; default: 150
--taxonkit_db TAXONKIT_DB
Path to taxonkit_db; default:
<path_to_aquamis>/reference_db/taxonkit
--confindr_database CONFINDR_DATABASE
Path to confindr databases; default:
<path_to_aquamis>/reference_db/confindr
--min_trimmed_length MIN_TRIMMED_LENGTH
Minimum length of a read to keep; default: 15
--assembler ASSEMBLER
Assembler to use in shovill, choose from megahit
velvet skesa spades; default: "spades"
--shovill_output_options SHOVILL_OUTPUT_OPTIONS
Extra output options for shovill; default: ""
--shovill_depth SHOVILL_DEPTH
Sub-sample --R1/--R2 to this depth. Disable with
--depth 0; default: 100
--shovill_ram SHOVILL_RAM
Limit amount of RAM provided to shovill; default: 16
--shovill_tmpdir SHOVILL_TMPDIR
Fast temporary directory; default: "/tmp/shovill"
--shovill_extraopts SHOVILL_EXTRAOPTS
Extra options for shovill; default: ""
--shovill_modules SHOVILL_MODULES
Module options for shovill; choose from --noreadcorr
--trim --nostitch --nocorr --noreadcorr; default: "--
noreadcorr"
--mlst_scheme MLST_SCHEME
Extra options for MLST; default: ""
-t THREADS, --threads THREADS
Number of Threads to use. Ideally multiple of 10;
default: 10
--threads_sample THREADS_SAMPLE
Number of Threads to use per sample; default: 1
-c CONDAPREFIX, --condaprefix CONDAPREFIX
Path of default conda environment, enables recycling
built environments; default: "<path_to_aquamis>/conda_env"
-n, --dryrun Snakemake dryrun. Only calculate graph without
executing anything
--forceall Snakemake force. Force recalculation of all steps
-f RULE, --force RULE Snakemake force. Force recalculation of output (rule
or file) speciefied here
--fix_fails Re-run snakemake after failure removing failed samples
--unlock Unlock a snakemake execution folder if it had been
interrupted
--no_assembly Only trimming and kraken analysis
--ephemeral Snakemake All-Rule: Remove all temporary data except
result JSONs and Reports
--remove_temp Remove large temporary files. May lead to slower re-
runs but saves disk space.
--use_conda Utilize the Snakemake "--useconda" option, i.e. Smk
rules require execution with a specific conda env
--conda_frontend Do not use mamba but conda as frontend to create
individual conda environments.
For example:
<path_to_aquamis>/aquamis.py -l <path_to_data>/samples.tsv -s <path_to_aquamis>/Snakefile -c <path_to_envs> -m <path_to_databases>/mash/mash_db.msh -d <path_to_data>
You can also run snakemake directly
snakemake -p --conda-prefix <path_to_envs> --keep-going --configfile <path_to_data>/config.yaml --snakefile <path_to_aquamis>/Snakefile --use-conda
AQUAMIS is built to be used routinely. To ensure a maximum comparability of the results, a default config.yaml
file is generated when calling the aquamis.py
wrapper script. The wrapper itself only allows configuring basic functionalities. The config.yaml can be initialized by starting AQUAMIS with the dry-run flag -n . Then, you can alter it to configure AQUAMIS in more detail.
AQUAMIS will provide you with an interactive, browser-based report, showing the most important measures of your data on the first sight. All tables in the report can be sorted and filtered. Short Summary Table shows the key values for a quick estimation of the success of your sequencing experiment and the assembly. Detailed Assembly Table is giving many additional measures. Thresholds is a copy of the applied threshold definition file. In addition to the tables, many measures are provided as graphical feedback. Plots per Run and Plots per Sample are generated for one complete sequencing experiment and each show measures on one specific dataset, respectively.
In addition, all results are stored in JSON format in the subfolders /json/pre_assembly
, /json/post_assembly
and /json/post_qc
of your current working directory <path_to_data>
. The content of /json/pre_assembly
files is a subset of /json/post_assembly
and combines trimming, contamination assessment and read-based taxonomic classification results prior to the assembly stage. /json/pre_assembly
represents the final digest when assembly is omitted by enforcing the Snakemake rule all_trimming_only, whereas /json/post_qc
represents the final digest of the full pipeline including the quality assessment. Each JSON file is named after its corresponding sample and has the following high-level structure:
.
├── sample/
│ ├── analysis
│ ├── summary
│ └── qc_assessment
└── pipelines/
├── fastp
├── confindr
├── kraken2/
│ ├── read_based
│ └── contig_based
├── shovill
├── samstats
├── mlst
├── mash
├── quast
├── busco
└── aquamis
The node… * sample/analysis
holds metadata on the sample fastq data paths, times of analyses, version info, database hashes and analysis parameters of each performed AQUAMIS call. * sample/summary
combines selected results of all modules, representing the Detailed Assembly Table and is also available as a single line per sample in the <path_to_data>/reports/summary_report.tsv
.
* sample/qc_assessment
holds QC evaluations based on the thresholds defined in the file AQUAMIS_thresholds.json
. For reference, a copy of the latter definition file is available for queries in the tab Thresholds of the assembly report. * pipelines/
stores the detailed results of each bioinformatic module/tool in a full take approach.
For easy data mining of multiple sample JSON files in R
, please follow the methods used in the markdown cells Import Sample JSONs and Deserialize
and read_data
of <path_to_aquamis>/scripts/write_report.Rmd
using the R packages jsonlite
, rrapply
and purrr
.
The ConFindr installation already provides databases for Listeria, Salmonella and E. coli. Additional databases for Campylobacter, Bacillus, Brucella, Staphyloccus can be found here:
cd <path_to_databases> # free to choose
wget --output-document confindr_db.tar.gz https://gitlab.bfr.berlin/bfr_bioinformatics/aquamis_databases/-/raw/main/confindr_db.tar.gz
tar -xzvf confindr_db.tar.gz -C <path_to_databases>
Specify the path <path_to_databases>/confindr
in the --confindr_database
flag.
You may also consider using the species agnostic rMLST database described here.
We propose using the latest minikraken2 and associated bracken database, see here for details Alternatively you can download a legacy version:
cd <path_to_databases> # free to choose
wget --output-document minikraken2.tgz https://gitlab.bfr.berlin/bfr_bioinformatics/aquamis_databases/-/raw/main/minikraken2.tar.gz
tar -zxvf minikraken2.tgz
Specify the path <path_to_databases>/minikraken2
in the --kraken2db
flag.
For later identification of the database used in an analysis, we calculated SHA256 hashes of various published TAR archives and the k-mer database within (hash.k2d
). These can be reviewed in the JSON <path_to_aquamis>/resources/kraken2_db_hashes.json
.
cd <path_to_databases> && mkdir <path_to_databases>/taxonkit # free to choose
wget --output-document taxdump.tar.gz https://gitlab.bfr.berlin/bfr_bioinformatics/aquamis_databases/-/raw/main/taxdump.tar.gz # 54MB or ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
tar -xzvf taxdump.tar.gz -C <path_to_databases>/taxonkit/
Specify the path <path_to_databases>/taxonkit
in the --taxonkit_db
flag.
cd <path_to_databases> && mkdir <path_to_databases>/mash # free to choose
wget --output-document mashDB.tar.gz https://gitlab.bfr.berlin/bfr_bioinformatics/aquamis_databases/-/raw/main/mashDB.tar.gz
tar -xzvf mashDB.tar.gz -C <path_to_databases>/mash/
Specify the path <path_to_databases>/mash
in the --mashdb
flag.
cd <path_to_envs>/aquamis/lib/python3.7/site-packages/quast_libs/busco/ # exact path depends on conda installation
wget --output-document bacteria.tar.gz https://gitlab.bfr.berlin/bfr_bioinformatics/aquamis_databases/-/raw/main/bacteria.tar.gz
tar -xzvf bacteria.tar.gz
To detect the path of your Quast environment and associated Python library path, you may type:
find <path_to_envs>/aquamis -name quast
Augustus is an additional dependency to Quast v5 that should be downloaded and installed automatically. In case there is a network issue, please install it manually by typing:
cd <path_to_envs>/aquamis/lib/python3.7/site-packages/quast_libs # exact path depends on conda installation
wget -O augustus.tar.gz https://gitlab.bfr.berlin/bfr_bioinformatics/aquamis_databases/-/raw/main/augustus.tar.gz
tar -xzvf augustus.tar.gz
Test data is provided by downloading the following tarball:
wget --output-document test_data.tar.gz https://gitlab.bfr.berlin/bfr_bioinformatics/aquamis_databases/-/raw/main/test_data.tar.gz
tar -xzvf test_data.tar.gz -C <path_to_data>/
cd <path_to_data>
<path_to_aquamis>/scripts/create_sampleSheet.sh --help
You can find the AQUAMIS report for this test data set here.
A report with demonstration data for failed QC samples can be found here.
Please consult the AQUAMIS project website for questions.
If this does not help, please feel free to consult: