Metadata-Version: 2.4
Name: antismash
Version: 0.1.0
Summary: metasmash: scalable metagenome-scale BGC mining, a fork of antiSMASH.
Home-page: https://github.com/canerbagci/metasmash
Author: Caner Bağcı
Author-email: caner.bagci@uni-tuebingen.de
License: GNU Affero General Public License v3 or later (AGPLv3+)
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: License :: OSI Approved :: GNU Affero General Public License v3 or later (AGPLv3+)
Classifier: Operating System :: OS Independent
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE.txt
Requires-Dist: brawn
Requires-Dist: numpy
Requires-Dist: biopython==1.81
Requires-Dist: helperlibs
Requires-Dist: jinja2
Requires-Dist: joblib
Requires-Dist: jsonschema==4.25.1
Requires-Dist: markupsafe>=2.0
Requires-Dist: nrpys>=0.1.1
Requires-Dist: bcbio-gff==0.7.1
Requires-Dist: libsass>=0.22
Requires-Dist: matplotlib
Requires-Dist: orjson
Requires-Dist: scipy
Requires-Dist: scikit-learn>=0.19.0
Requires-Dist: MOODS-python
Provides-Extra: testing
Requires-Dist: pytest<8,>=7.2.0; extra == "testing"
Requires-Dist: coverage; extra == "testing"
Requires-Dist: pylint==3.0.2; extra == "testing"
Requires-Dist: mypy==1.17.1; extra == "testing"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: license
Dynamic: license-file
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# MetaSMASH

A scalable fork of [antiSMASH](https://github.com/antismash/antismash) for metagenome-scale biosynthetic gene cluster (BGC) mining.

## What's different

MetaSMASH adds a streaming pipeline on top of antiSMASH that scales to millions of input records with bounded memory usage:

- **Streaming pipeline** (`--streaming auto|on|off`) — two-phase processing (detection, then analysis) that processes records one at a time instead of loading all into memory. Auto-enabled for inputs with >10 records.
- **Incremental output** — JSON and GenBank files are written per-record as they complete, so results are available even if a run is interrupted.
- **Lazy parallel execution** — worker results are yielded one at a time instead of collected, enabling constant-memory parallelism.
- **Cache preloading with fork CoW** — PFAM and analysis databases are loaded before forking so worker processes share read-only memory via copy-on-write.
- **Metagenome dashboard** — summary view across all records in the HTML output.
- **Record filtering** — `--output-skip-records-without-regions` (on by default) omits records without detected BGC regions from the output.

The classic (non-streaming) antiSMASH pipeline is fully preserved and can be used with `--streaming off`.

## Installation

Requires Python >= 3.11. Not on PyPI — install from source.

```bash
# Create a conda environment with required external tools
conda create -n metasmash -c bioconda -c conda-forge \
    "python>=3.11" blast diamond hmmer hmmer2 prodigal

conda activate metasmash

# Clone and install MetaSMASH
git clone https://github.com/canerbagci/metasmash.git
cd metasmash
pip install .

# Download databases (same as upstream antiSMASH)
download-antismash-databases
```

## Usage

```bash
# Basic usage (streaming auto-enabled for large inputs)
metasmash input.fasta

# Force streaming mode
metasmash --streaming on input.fasta

# Disable streaming (classic antiSMASH behaviour)
metasmash --streaming off input.fasta
```

### New options

| Option | Default | Description |
|--------|---------|-------------|
| `--streaming {auto,on,off}` | `auto` | `auto` enables streaming for >10 records, `on` forces it, `off` uses classic pipeline |
| `--streaming-phase1-batch-size N` | `0` (auto) | Override records-per-batch in streaming Phase 1 (detection). `0` keeps the automatic heuristic. |
| `--streaming-phase2-window-size N` | `0` (auto) | Override the Phase 2 analysis window size. `0` keeps the automatic heuristic. |
| `--workers W` | same as `--cpus` | Number of parallel worker processes. Each worker gets `cpus/workers` threads. Lower values reduce peak memory. |
| `--output-skip-records-without-regions` / `--no-output-skip-records-without-regions` | on | Omit records without detected BGC regions from JSON/GBK output |
| `--html-taxonomy PATH` | *(none)* | Two-column TSV mapping contig IDs to taxonomy lineages; rendered in the dashboard overview table. |

All standard antiSMASH options (e.g. `--cpus`, `--genefinding-tool`, `--cb-knownclusters`, `--minimal`) work as usual. Run `metasmash --help` for the full list, or see the [upstream documentation](https://docs.antismash.secondarymetabolites.org/).

### Parallelism and memory tuning

`--cpus` and `--workers` together control the trade-off between throughput and memory usage:

- **`--cpus N`** (default: all cores) — total CPU budget for the run.
- **`--workers W`** (default: same as `--cpus`) — number of parallel worker processes. Each worker gets `cpus / workers` threads for internal tools (hmmsearch, diamond, blastp, etc.).

In streaming mode, the two phases use workers differently:

| Phase | Workers | Threads per worker | Purpose |
|-------|---------|-------------------|---------|
| Phase 1 (detection) | `cpus` | 1 | Many lightweight workers scanning records in parallel |
| Phase 2 (analysis) | `workers` | `cpus / workers` | Fewer workers with more threads for heavier analysis modules |

**Tuning guidance:**

- **Fewer workers = less memory.** Each worker loads one record into memory, so fewer workers means fewer records resident simultaneously. Each worker compensates with more threads for I/O-heavy tools.
- **More workers = higher throughput, higher peak memory.** More records processed in parallel, but each holds its own copy of per-record data structures.
- **Example:** `metasmash --cpus 32 --workers 8 input.fasta` runs 8 parallel records, each using 4 threads internally — a good balance for large metagenomes on a 32-core machine.
- **Default (workers = cpus):** maximises record-level parallelism with single-threaded tools. Fine for small-to-medium inputs; consider lowering `--workers` for very large metagenomes to keep memory bounded.

## Upstream antiSMASH

For general antiSMASH usage, module documentation, citation information, and the web server, see the upstream project:

- **Repository**: <https://github.com/antismash/antismash>
- **Documentation**: <https://docs.antismash.secondarymetabolites.org/>
- **Citations**: <http://antismash.secondarymetabolites.org/#!/about>

## License

MetaSMASH is available under the GNU Affero General Public License v3.0 or later, same as upstream antiSMASH. See [`LICENSE.txt`](LICENSE.txt) for details.
