AQUAMIS - Assembly-based QUAlity assessment for Microbial Isolate Sequencing

Description

AQUAMIS is a pipeline for routine assembly and quality assessment of microbial isolate sequencing experiments. It is based on snakemake and includes the following tools:

It will read untrimmed fastq data from your Illumina sequencing experiments as paired .fastq.gz-files. These are then trimmed, assembled and polished. Besides generating ready-to-use contigs, AQUAMIS will select the closest reference genome from NCBI RefSeq and produce an intuitive, detailed report on your data and assemblies to evaluate its reliability for further analyses. It relies on reference-based and reference-free measures such as coverage depth, gene content, genome completeness and contamination, assembly length and many more. Based on the experience from thousands of sequencing experiments, threshold sets for different species have been defined to detect potentially poor results.

Website

The AQUAMIS project website is https://gitlab.com/bfr_bioinformatics/AQUAMIS

There, you can find the latest version, source code and documentation.

You can find an example AQUAMIS report here.

Installation

You can install AQUAMIS by installing the Bioconda package, by installing the Docker container or by cloning this repository and installing all dependencies with conda. AQUAMIS relies on the conda package manager for all dependencies. Please set up conda on your system as explained here. It is advised to use mamba instead of conda for resolving all software requirements (Install it via conda install mamba first or with the provided installation script: /scripts/aquamis_setup.sh).

Path Placeholders in this manual

Placeholder Path
<path_to_conda> is the conda installation folder, type conda info --base to retrieve its absolute path, typically ~/anaconda3 or ~/miniconda3
<path_to_envs> is the folder that holds your conda environments, typically <path_to_conda>/envs
<path_to_installation> is the parent folder of the AQUAMIS repository
<path_to_aquamis> is the base folder of the AQUAMIS repository, i.e. <path_to_installation>/AQUAMIS
<path_to_databases> is the parent folder of your databases, by default, AQUAMIS uses <path_to_aquamis>/reference_db, but you are free to choose a custom location
<path_to_data> is the working directory for an AQUAMIS analysis typically containing a subfolder <path_to_data>/fastq with your fastq read files

From Source

To install the latest stable version of AQUAMIS, please clone the git repository on your system.

cd <path_to_installation>
git clone https://gitlab.com/bfr_bioinformatics/AQUAMIS.git

AQUAMIS relies on the package manager conda for all dependencies. Please set up conda on your system as explained here.

Next, please execute the setup script with the appropriate options:

<path_to_aquamis>/scripts/aquamis_setup.sh --help

to install the conda dependency manager mamba, create the conda environment aquamis and install external databases within the default folder <path_to_aquamis>/reference_db.

Manual Conda Environment Setup

Alternatively, please initialize a conda base environment containing snakemake and mamba (mamba is faster in resolving dependencies), then:

mamba env create -f <path_to_aquamis>/envs/aquamis.yaml

This creates an environment named aquamis containing all dependencies. It is found under <path_to_conda>/envs/aquamis.

For custom database paths, please see the chapter Database setup.

From Bioconda

mamba create -n aquamis -c conda-forge -c bioconda aquamis

Please complement the Bioconda installation with reference databases of your choice or via the setup script (see previous chapter From Source). This setup script is also available from within the Bioconda installation after conda activate aquamis or, for direct execution, from the Bioconda installation path <path_to_envs>/aquamis/opt/aquamis/scripts/aquamis_setup.sh mamba is the recommended conda package dependency resolver.

From Docker

Prerequisite: Install the Docker engine for your favourite operating system (e.g. Ubuntu Linux).

Download the latest version of AQUAMIS from Docker Hub and note down the Docker Image ID on your system (hereafter refered as $docker_image_id) with the shell commands:

docker pull bfrbioinformatics/aquamis:latest
docker image list | grep "aquamis" | grep "latest" | awk '{ print $3 }'

To process data and write results, Docker needs a volume mapping from a host directory containing your sequence data (<path_to_data>) to the Docker container (/AQUAMIS/analysis). Your sample list (samples.tsv) needs to be located within <path_to_data> and contain relative paths to your NGS reads in the same or another child directory. You may generate a Docker-compatible sample list in your host directory (<path_to_data>/samples.tsv) by executing the create_sampleSheet.sh from the container with the following terminal commands:

host:<path_to_data>$ ls fastq/
sample1_R1.fastq   sample1_R2.fastq   sample2_R2.fastq   sample2_R2.fastq
docker run --rm \
  -v <path_to_data>:/AQUAMIS/analysis \
  -e HOST_PATH=<path_to_data> \
  -e LOCAL_USER_ID=$(id -u $USER) \
  --entrypoint bash \
  $docker_image_id \
  /AQUAMIS/scripts/create_sampleSheet.sh --mode ncbi \
  --fastxDir /AQUAMIS/analysis/fastq \
  --outDir /AQUAMIS/analysis

With the following command, AQUAMIS is started within the Docker container and will process any options appended:

docker run --rm \
  -v <path_to_data>:/AQUAMIS/analysis \
  -e HOST_PATH=<path_to_data> \
  -e LOCAL_USER_ID=$(id -u $USER) \
  $docker_image_id \
  --condaprefix /opt/conda/envs \
  --sample_list /AQUAMIS/analysis/samples.tsv \
  --working_directory /AQUAMIS/analysis \
  --<any_other_AQUAMIS_options>

Notes on Docker images: The Docker image conda represents the latest build from the Gitlab repository. The Docker image latest contains additional reference databases (also provided via the setup script) as well as a set of test data files (fastq) for validation purposes.

Notes on Docker usage: The container path /AQUAMIS/analysis is fixed and may not be altered. Any subdirectories of <path_to_data> will be available as subdirectories under /AQUAMIS/analysis/. Our container is able to write results with the Linux user and group ID of your choice (UID and GID, respectively) to blend into your host file permission setup. With the above option -e LOCAL_USER_ID==$(id -u $USER) the UID of the currently executing user is inherited, change it according to your needs. The absolute host path mapped to the container has to be provided as the environment variable $HOST_PATH, too. It is used for correcting file paths in the result JSON files of each sample to match the host perspective by using the wrapper argument --docker automatically.

Usage

Execution

To run AQUAMIS, source the conda environment aquamis and call the wrapper script:

conda activate aquamis
python3 aquamis.py --help
usage: aquamis.py [-h, --help] [-V, --version]
                  -l SAMPLE_LIST -d WORKING_DIRECTORY
                  [-s, --snakefile SNAKEFILE]
                  [-r, --run_name RUN_NAME] [--docker DOCKER]
                  [--qc_thresholds QC_THRESHOLDS] [--json_schema JSON_SCHEMA]
                  [--json_filter JSON_FILTER] [-m, --mashdb MASHDB]
                  [--mash_kmersize MASH_KMERSIZE]
                  [--mash_sketchsize MASH_SKETCHSIZE] [--kraken2db KRAKEN2DB]
                  [--taxlevel_qc TAXLEVEL_QC] [--read_length READ_LENGTH]
                  [--taxonkit_db TAXONKIT_DB]
                  [--confindr_database CONFINDR_DATABASE]
                  [--min_trimmed_length MIN_TRIMMED_LENGTH]
                  [--assembler ASSEMBLER]
                  [--shovill_output_options SHOVILL_OUTPUT_OPTIONS]
                  [--shovill_depth SHOVILL_DEPTH] [--shovill_ram SHOVILL_RAM]
                  [--shovill_tmpdir SHOVILL_TMPDIR]
                  [--shovill_extraopts SHOVILL_EXTRAOPTS]
                  [--shovill_modules SHOVILL_MODULES]
                  [--mlst_scheme MLST_SCHEME] [-t, --threads THREADS]
                  [--threads_sample THREADS_SAMPLE] [-c, --condaprefix CONDAPREFIX]
                  [-n, --dryrun] [-f, --force RULE] 
                  [--forceall] [--fix_fails] [--unlock]
                  [--no_assembly] [--ephemeral] [--remove_temp] [--use_conda]
                  [--conda_frontend]

optional arguments:
  -h, --help            show this help message and exit
  -V, --version         Print program version.
  -l SAMPLE_LIST, --sample_list SAMPLE_LIST
                        List of samples to assemble, format as defined by ...
  -d WORKING_DIRECTORY, --working_directory WORKING_DIRECTORY
                        Working directory
  -s SNAKEFILE, --snakefile SNAKEFILE
                        Path to Snakefile of bakcharak pipeline; default: path
                        to Snakefile in same directory
  -r RUN_NAME, --run_name RUN_NAME
                        Name of the sequencing run for all samples in the
                        sample list, e.g.
                        "210401_M02387_0709_000000000-HBXX6", or a self-chosen
                        analysis title
  --docker DOCKER       Mapped volume path of the host system
  --qc_thresholds QC_THRESHOLDS
                        Definition of thresholds in JSON file; default: 
                        <path_to_aquamis>/thresholds/AQUAMIS_thresholds.json
  --json_schema JSON_SCHEMA
                        JSON schema used for validation; default:
                        <path_to_aquamis>/resources/AQUAMIS_schema_v20210226.json
  --json_filter JSON_FILTER
                        Definition of thresholds in JSON file; default:
                        <path_to_aquamis>/thresholds/AQUAMIS_schema_filter_v20210226.json
  -m MASHDB, --mashdb MASHDB
                        Path to reference mash database; default:
                        <path_to_aquamis>/reference_db/mash/mashDB.msh
  --mash_kmersize MASH_KMERSIZE
                        kmer size for mash, must match size of database;
                        default: 21
  --mash_sketchsize MASH_SKETCHSIZE
                        sketch size for mash, must match size of database;
                        default: 1000
  --kraken2db KRAKEN2DB
                        Path to kraken2 database; default:
                        <path_to_aquamis>/reference_db/kraken2
  --taxlevel_qc TAXLEVEL_QC
                        Taxonomic level for kraken2 classification quality
                        control. Choose S for species or G for genus; default:
                        "G"
  --read_length READ_LENGTH
                        Read length to be used in bracken abundance
                        estimation; default: 150
  --taxonkit_db TAXONKIT_DB
                        Path to taxonkit_db; default:
                        <path_to_aquamis>/reference_db/taxonkit
  --confindr_database CONFINDR_DATABASE
                        Path to confindr databases; default:
                        <path_to_aquamis>/reference_db/confindr
  --min_trimmed_length MIN_TRIMMED_LENGTH
                        Minimum length of a read to keep; default: 15
  --assembler ASSEMBLER
                        Assembler to use in shovill, choose from megahit
                        velvet skesa spades; default: "spades"
  --shovill_output_options SHOVILL_OUTPUT_OPTIONS
                        Extra output options for shovill; default: ""
  --shovill_depth SHOVILL_DEPTH
                        Sub-sample --R1/--R2 to this depth. Disable with
                        --depth 0; default: 100
  --shovill_ram SHOVILL_RAM
                        Limit amount of RAM provided to shovill; default: 16
  --shovill_tmpdir SHOVILL_TMPDIR
                        Fast temporary directory; default: "/tmp/shovill"
  --shovill_extraopts SHOVILL_EXTRAOPTS
                        Extra options for shovill; default: ""
  --shovill_modules SHOVILL_MODULES
                        Module options for shovill; choose from --noreadcorr
                        --trim --nostitch --nocorr --noreadcorr; default: "--
                        noreadcorr"
  --mlst_scheme MLST_SCHEME
                        Extra options for MLST; default: ""
  -t THREADS, --threads THREADS
                        Number of Threads to use. Ideally multiple of 10;
                        default: 10
  --threads_sample THREADS_SAMPLE
                        Number of Threads to use per sample; default: 1
  -c CONDAPREFIX, --condaprefix CONDAPREFIX
                        Path of default conda environment, enables recycling
                        built environments; default: "<path_to_aquamis>/conda_env"
  -n, --dryrun          Snakemake dryrun. Only calculate graph without
                        executing anything
  --forceall            Snakemake force. Force recalculation of all steps
  -f RULE, --force RULE Snakemake force. Force recalculation of output (rule
                        or file) speciefied here
  --fix_fails           Re-run snakemake after failure removing failed samples
  --unlock              Unlock a snakemake execution folder if it had been
                        interrupted
  --no_assembly         Only trimming and kraken analysis
  --ephemeral           Snakemake All-Rule: Remove all temporary data except
                        result JSONs and Reports
  --remove_temp         Remove large temporary files. May lead to slower re-
                        runs but saves disk space.
  --use_conda           Utilize the Snakemake "--useconda" option, i.e. Smk
                        rules require execution with a specific conda env
  --conda_frontend      Do not use mamba but conda as frontend to create
                        individual conda environments.

For example:

<path_to_aquamis>/aquamis.py -l <path_to_data>/samples.tsv -s <path_to_aquamis>/Snakefile -c <path_to_envs> -m <path_to_databases>/mash/mash_db.msh -d <path_to_data>

You can also run snakemake directly

snakemake -p --conda-prefix <path_to_envs> --keep-going --configfile <path_to_data>/config.yaml --snakefile <path_to_aquamis>/Snakefile --use-conda

Configuration

AQUAMIS is built to be used routinely. To ensure a maximum comparability of the results, a default config.yaml file is generated when calling the aquamis.py wrapper script. The wrapper itself only allows configuring basic functionalities. The config.yaml can be initialized by starting AQUAMIS with the dry-run flag -n . Then, you can alter it to configure AQUAMIS in more detail.

Results

AQUAMIS will provide you with an interactive, browser-based report, showing the most important measures of your data on the first sight. All tables in the report can be sorted and filtered. Short Summary Table shows the key values for a quick estimation of the success of your sequencing experiment and the assembly. Detailed Assembly Table is giving many additional measures. Thresholds is a copy of the applied threshold definition file. In addition to the tables, many measures are provided as graphical feedback. Plots per Run and Plots per Sample are generated for one complete sequencing experiment and each show measures on one specific dataset, respectively.

JSON output

In addition, all results are stored in JSON format in the subfolders /json/pre_assembly, /json/post_assembly and /json/post_qc of your current working directory <path_to_data>. The content of /json/pre_assembly files is a subset of /json/post_assembly and combines trimming, contamination assessment and read-based taxonomic classification results prior to the assembly stage. /json/pre_assembly represents the final digest when assembly is omitted by enforcing the Snakemake rule all_trimming_only, whereas /json/post_qc represents the final digest of the full pipeline including the quality assessment. Each JSON file is named after its corresponding sample and has the following high-level structure:

.
├── sample/
│   ├── analysis
│   ├── summary
│   └── qc_assessment
└── pipelines/
    ├── fastp
    ├── confindr
    ├── kraken2/
    │   ├── read_based
    │   └── contig_based
    ├── shovill
    ├── samstats
    ├── mlst
    ├── mash
    ├── quast
    ├── busco
    └── aquamis

The node… * sample/analysis holds metadata on the sample fastq data paths, times of analyses, version info, database hashes and analysis parameters of each performed AQUAMIS call. * sample/summary combines selected results of all modules, representing the Detailed Assembly Table and is also available as a single line per sample in the <path_to_data>/reports/summary_report.tsv.
* sample/qc_assessment holds QC evaluations based on the thresholds defined in the file AQUAMIS_thresholds.json. For reference, a copy of the latter definition file is available for queries in the tab Thresholds of the assembly report. * pipelines/ stores the detailed results of each bioinformatic module/tool in a full take approach.

For easy data mining of multiple sample JSON files in R, please follow the methods used in the markdown cells Import Sample JSONs and Deserialize and read_data of <path_to_aquamis>/scripts/write_report.Rmd using the R packages jsonlite, rrapply and purrr.

Database Setup

ConFindr database

The ConFindr installation already provides databases for Listeria, Salmonella and E. coli. Additional databases for Campylobacter, Bacillus, Brucella, Staphyloccus can be found here:

cd <path_to_databases>   # free to choose
wget --output-document confindr_db.tar.gz https://gitlab.bfr.berlin/bfr_bioinformatics/aquamis_databases/-/raw/main/confindr_db.tar.gz
tar -xzvf confindr_db.tar.gz -C <path_to_databases>

Specify the path <path_to_databases>/confindr in the --confindr_database flag.

You may also consider using the species agnostic rMLST database described here.

Kraken2 and bracken database

We propose using the latest minikraken2 and associated bracken database, see here for details Alternatively you can download a legacy version:

cd <path_to_databases>   # free to choose
wget --output-document minikraken2.tgz https://gitlab.bfr.berlin/bfr_bioinformatics/aquamis_databases/-/raw/main/minikraken2.tar.gz
tar -zxvf minikraken2.tgz

Specify the path <path_to_databases>/minikraken2 in the --kraken2db flag.

For later identification of the database used in an analysis, we calculated SHA256 hashes of various published TAR archives and the k-mer database within (hash.k2d). These can be reviewed in the JSON <path_to_aquamis>/resources/kraken2_db_hashes.json.

Taxonomy database

cd <path_to_databases> && mkdir <path_to_databases>/taxonkit   # free to choose
wget --output-document taxdump.tar.gz https://gitlab.bfr.berlin/bfr_bioinformatics/aquamis_databases/-/raw/main/taxdump.tar.gz  #  54MB or ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
tar -xzvf taxdump.tar.gz -C <path_to_databases>/taxonkit/

Specify the path <path_to_databases>/taxonkit in the --taxonkit_db flag.

mash database

cd <path_to_databases> && mkdir <path_to_databases>/mash  # free to choose
wget --output-document mashDB.tar.gz https://gitlab.bfr.berlin/bfr_bioinformatics/aquamis_databases/-/raw/main/mashDB.tar.gz
tar -xzvf mashDB.tar.gz -C <path_to_databases>/mash/

Specify the path <path_to_databases>/mash in the --mashdb flag.

Quast module: BUSCO

cd <path_to_envs>/aquamis/lib/python3.7/site-packages/quast_libs/busco/   # exact path depends on conda installation
wget --output-document bacteria.tar.gz https://gitlab.bfr.berlin/bfr_bioinformatics/aquamis_databases/-/raw/main/bacteria.tar.gz
tar -xzvf bacteria.tar.gz

To detect the path of your Quast environment and associated Python library path, you may type:

find <path_to_envs>/aquamis -name quast

Quast module: Augustus

Augustus is an additional dependency to Quast v5 that should be downloaded and installed automatically. In case there is a network issue, please install it manually by typing:

cd <path_to_envs>/aquamis/lib/python3.7/site-packages/quast_libs   # exact path depends on conda installation
wget -O augustus.tar.gz https://gitlab.bfr.berlin/bfr_bioinformatics/aquamis_databases/-/raw/main/augustus.tar.gz
tar -xzvf augustus.tar.gz

Test data

Test data is provided by downloading the following tarball:

wget --output-document test_data.tar.gz https://gitlab.bfr.berlin/bfr_bioinformatics/aquamis_databases/-/raw/main/test_data.tar.gz
tar -xzvf test_data.tar.gz -C <path_to_data>/
cd <path_to_data>
<path_to_aquamis>/scripts/create_sampleSheet.sh --help

You can find the AQUAMIS report for this test data set here.

A report with demonstration data for failed QC samples can be found here.

Contact

Please consult the AQUAMIS project website for questions.

If this does not help, please feel free to consult: