Metadata-Version: 2.4
Name: baktfold
Version: 0.2.0
Summary: Rapid and standardized annotation of bacterial genomes, MAGs and plasmids using protein structural information
Author-email: George Bouras <george.bouras@adelaide.edu.au>, Oliver Schwengers <oliver.schwengers@cb.jlug.de>
Project-URL: Homepage, https://github.com/gbouras13/baktfold
Project-URL: Documentation, https://baktfold.readthedocs.io/en/latest
Keywords: keyword,are,cool
Classifier: Development Status :: 4 - Beta
Classifier: Environment :: Console
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: English
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Programming Language :: Python :: 3.14
Requires-Python: <4,>=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: click>=8.0.0
Requires-Dist: loguru>=0.5.3
Requires-Dist: pyyaml>=6.0
Requires-Dist: pandas>=1.4.2
Requires-Dist: biopython>=1.80
Requires-Dist: datasets>=2.15
Requires-Dist: requests>=2.25
Requires-Dist: sentencepiece>=0.1.99
Requires-Dist: transformers>=4.34
Requires-Dist: torch>=2.1.2
Requires-Dist: pyarrow>=14.0.0
Requires-Dist: alive-progress>=3.0.1
Requires-Dist: numpy<2,>=1.20
Requires-Dist: h5py>=3.5
Requires-Dist: tqdm>=4.35.0
Requires-Dist: xopen>=1.5.0
Requires-Dist: protobuf>=3.20
Requires-Dist: tiktoken>=0.5.0
Provides-Extra: lint
Requires-Dist: isort; extra == "lint"
Requires-Dist: black; extra == "lint"
Provides-Extra: test
Requires-Dist: pytest>=6.2.5; extra == "test"
Requires-Dist: pytest-cov>=3.0.0; extra == "test"
Provides-Extra: exe
Requires-Dist: setuptools; extra == "exe"
Requires-Dist: wheel; extra == "exe"
Requires-Dist: build; extra == "exe"
Dynamic: license-file

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gbouras13/baktfold/blob/main/run_baktfold.ipynb)

[![Anaconda-Server Badge](https://anaconda.org/bioconda/baktfold/badges/version.svg)](https://anaconda.org/bioconda/baktfold)
[![Bioconda Downloads](https://img.shields.io/conda/dn/bioconda/baktfold)](https://img.shields.io/conda/dn/bioconda/baktfold)
[![PyPI version](https://badge.fury.io/py/baktfold.svg)](https://badge.fury.io/py/baktfold)
[![Downloads](https://static.pepy.tech/badge/baktfold)](https://pepy.tech/project/baktfold)

# Baktfold
Rapid &amp; standardized protein annotation using structural information

**To learn more about `baktfold`, please [read our preprint](https://www.biorxiv.org/content/10.64898/2026.03.31.715528v1)**  

> George Bouras, Sung won Lim, Lindsay Durr, Sarah Vreugde, Alexander Goesmann, Robert A. Edwards, Oliver Schwengers
> Baktfold: Sensitive protein functional annotation across the microbial tree of life using structural information
> bioRxiv 2026.03.31.715528; doi: https://doi.org/10.64898/2026.03.31.715528

Baktfold is a sensitive annotation tool for protein annotation using structural homology. While it was designed with bacterial genomes in mind to work in conjunction with Bakta (hence the name!), Baktfold also works well on archaea, plasmids and even eukaryotes.

Baktfold is similar to [Phold](https://github.com/gbouras13/phold) but goes beyond phages. 

Baktfold takes all _hypothetical proteins_ from [Bakta's](https://github.com/oschwengers/bakta) output and uses the [ProstT5](https://github.com/mheinzinger/ProstT5) protein language model to rapidly translate protein amino acid sequences to the 3Di token alphabet used by [Foldseek](https://github.com/steineggerlab/foldseek). Foldseek is then used to search these against a series of databases (SwissProt, AlphaFold Database non-singleton clusters, PDB and CATH).

Along with Bakta, Baktfold also supports [Prokka](https://github.com/tseemann/prokka) and generic GenBank format files as input.

Additionally, instead of using ProstT5, you can specify protein structures that you have pre-computed for your hypothetical proteins.

You can also specify custom databases to search against using `--custom-db`.

**Baktfold is currently under active development. We would welcome any and all feedback (especially bugs) via Issues**

# Google Colab Notebook

If you don't want to install Baktfold locally, you can run it without any code using the [Google Colab notebook](https://colab.research.google.com/github/gbouras13/baktfold/blob/main/run_baktfold.ipynb)

# Webserver

Baktfold is available to run via Bakta's web server - the provisional link is [here](https://staging.bakta.computational.bio/).

# Table of Contents

- [Baktfold](#baktfold)
- [Google Colab Notebook](#google-colab-notebook)
- [Webserver](#webserver)
- [Table of Contents](#table-of-contents)
  - [Install](#install)
    - [Conda (recommended)](#conda-recommended)
    - [Pip](#pip)
    - [Source](#source)
    - [Database Installation](#database-installation)
  - [Example - Bacteria](#example---bacteria)
  - [Conversion wrapper commands](#conversion-wrapper-commands)
  - [Usage](#usage)
  - [Output](#output)
    - [Conceptual terms](#conceptual-terms)
  - [Citations](#citations)

## Install

### Conda (recommended)

The best way to install Baktfold is using conda, as this will install Foldseek (the only non-Python dependency) along with the Python dependencies
We would highly recommend installing Conda via [Miniforge](https://github.com/conda-forge/miniforge):

```bash
conda create -n baktfoldENV -c conda-forge -c bioconda baktfold 
```

To utilise phold with GPU, a GPU compatible version of pytorch must be installed (default: CPU-only version):

```bash
conda create -n baktfoldENV -c conda-forge -c bioconda baktfold pytorch=*=cuda*
```

If you have a Mac with M-series Apple Silicon, you may need to install a particular version of Pytorch to utilise GPU-acceleration. The same is true if you use other non-NVIDIA e.g. AMD GPUs. See [this link](https://phold.readthedocs.io/en/latest/install/) for some more detail and further links

### Pip

You can also install Baktfold using Pip:

```bash
pip install baktfold
```

You will need to have Foldseek (ideally v10.941cd33) installed and available in the $PATH.

### Source

You can install the latest version of Baktfold with potentially untested and unreleased changes into a conda environment using conda as follows:

```bash
conda create -n baktfoldENV foldseek
conda activate baktfoldENV
git clone https://github.com/gbouras13/baktfold.git
cd baktfold
pip install .
baktfold --help
```

### Database Installation

To download and install Baktfold's databases (use as many threads with `-t` as you can to speed up downloading):

```bash
baktfold install -d baktfold_db -t 8
```

If you have an NVIDIA GPU, you will need to format the database to allow it to use Foldseek-GPU with `--foldseek-gpu`.

Note: you can do this after downloading the database with the above command (it won't redownload the database, only do the relevant Foldseek database padding)

```bash
baktfold install -d baktfold_db --foldseek-gpu
```

## Example - Bacteria

First, you need to run [Bakta](https://github.com/oschwengers/bakta) and use the resulting `.json` file as input for Baktfold. For bacteria or plasmids, we always recommend Bakta.

Running Baktfold on Bakta results (using a dummy test example JSON `assembly.json` file):

```bash
# default (CPU-only or non-NVIDIA GPU e.g. Mac or AMD)
baktfold run -i tests/test_data/assembly_bakta_output/assembly.json  -o baktfold_output -f -t 8 -d baktfold_db   
# with Nvidia GPU
baktfold run -i tests/test_data/assembly_bakta_output/assembly.json  -o baktfold_output -f -t 8 -d baktfold_db   --foldseek-gpu
```

Running Baktfold on protein sequences (using a dummy test example Fasta `.faa` file):
  * Note that this can be any `.faa`. It does not have to be the output of Bakta.

```bash
# default (CPU-only or non-NVIDIA GPU e.g. Mac or AMD)
baktfold proteins -i tests/test_data/assembly.hypotheticals.faa  -o baktfold_proteins_output -f -t 8 -d baktfold_db   
# with Nvidia GPU 
baktfold proteins -i tests/test_data/assembly.hypotheticals.faa  -o baktfold_proteins_output -f -t 8 -d baktfold_db   --foldseek-gpu
```

As of v0.2.0, you can also use `baktfold proteins` with a Bakta proteins JSON output format file e.g.
  * Note that this will only annotate the hypothetical proteins in the Bakta JSON output.
  * Note also that `baktfold proteins` will automatically detect the format (`.faa` or Bakta JSON)

```bash
baktfold proteins -i tests/test_data/assembly_bakta_proteins_output_all/assembly.json  -o baktfold_proteins_output_from_json -f -t 8 -d baktfold_db   
# with Nvidia GPU 
baktfold proteins -i tests/test_data/assembly_bakta_proteins_output_all/assembly.json  -o baktfold_proteins_output_from_json -f -t 8 -d baktfold_db   --foldseek-gpu
```

## Conversion wrapper commands

If you have not used Bakta to annotate your genome before running Baktfold, you have two choices: (1) annotate proteins only with `baktfold proteins` or (2) if you have a GenBank format file, you will need to convert your GenBank to the Bakta `.json` format

To do this, you have 3 options:

1. `baktfold convert-prokka`

* If you have used Prokka to annotate your genome, `baktfold` has a subcommand that will do the conversion for you
* e.g.
```bash
baktfold convert-prokka -i prokka.gbk -o prokka.json
```

2. `baktfold convert-euk`

* This is an experimental feature for eukaryotes (protists, fungi etc) - you can try converting these with a subcommand
* You will then need to pass `--euk` to `baktfold run` as well to make sure it can handle the different genomic features of eukaryotes

* e.g.
```bash
baktfold convert-euk -i euk.gbk -o euk.json
```

3. `genbank_to`

* If neither of those work for you, you try the [genbank_to](https://github.com/linsalrob/genbank_to) package which has the functionality of converting a genbank file into the Bakta format JSON
* You will need to install it separately (`pip install genbank_to`) then

* e.g.
```bash
genbank_to -g test.gbk --bakta-json test.json
```

## Usage

The two most useful commands are `baktfold run` and `baktfold proteins`

- `baktfold run` accepts a __Bakta JSON file__ as input, and by default, it will annotate all hypothetical CDS and return a variety of Bakta-like compliant output formats. All other annotations will be inherited from the Bakta output
- `baktfold proteins` accepts a protein FASTA `.faa` format file as input. It will annotate all protein sequences and return a variety of `bakta_proteins`-like output formats
- `baktfold predict` and `baktfold compare` split `baktfold run` into the ProstT5 and Foldseek modules, while `baktfold proteins-predict` and `baktfold proteins-compare` do the same for `baktfold proteins` (useful if you have non-NVIDIA GPUs)

It is recommend you run Baktfold with a GPU if you can. If you do not have a GPU, Baktfold will still run, but the ProstT5 step will be fairly slow. If you have a NVIDIA GPU, you can also use the `--foldseek-gpu` parameter to accelerate Foldseek further

```bash
Usage: baktfold [OPTIONS] COMMAND [ARGS]...

  Main command line interface for baktfold.

  Returns:   None

  Examples:   >>> main_cli()   None

Options:
  -h, --help     Show this message and exit.
  -V, --version  Show the version and exit.

Commands:
  autotune          Determines optimal batch size for 3Di prediction with...
  citation          Print the citation(s) for this tool
  compare           Runs Foldseek vs baktfold db
  convert-euk       (Experimental) Converts eukaryotic GenBank to Bakta...
  convert-prokka    Converts Prokka GenBank to Bakta format json
  createdb          Creates foldseek DB from AA FASTA and 3Di FASTA input...
  install           Installs ProstT5 model and baktfold database
  predict           Uses ProstT5 to predict 3Di tokens - GPU recommended
  proteins          baktfold proteins-predict then comapare all in one -...
  proteins-compare  Runs Foldseek vs baktfold db on proteins input
  proteins-predict  Runs ProstT5 on a multiFASTA input - GPU recommended
  run               baktfold predict then comapare all in one - GPU...
```

```bash
Usage: baktfold run [OPTIONS]

  baktfold predict then comapare all in one - GPU recommended

Options:
  -h, --help                     Show this message and exit.
  -V, --version                  Show the version and exit.
  -i, --input PATH               Path to input file in Bakta Genbank format or
                                 Bakta JSON format  [required]
  -o, --output PATH              Output directory   [default: output_baktfold]
  -t, --threads INTEGER          Number of threads  [default: 1]
  -p, --prefix TEXT              Prefix for output files  [default: baktfold]
  -d, --database TEXT            Specific path to installed baktfold database
  -f, --force                    Force overwrites the output directory
  --autotune                     Run autotuning to detect and automatically
                                 use best batch size for your hardware.
                                 Recommended only if you have a large dataset
                                 (e.g. thousands of proteins), or else
                                 autotuning will add rather than save runtime.
  --batch-size INTEGER           batch size for ProstT5. 1 is usually fastest.
                                 [default: 1]
  --cpu                          Use cpus only.
  --omit-probs                   Do not output per residue 3Di probabilities
                                 from ProstT5. Mean per protein 3Di
                                 probabilities will always be output.
  --save-per-residue-embeddings  Save the ProstT5 embeddings per resuide in a
                                 h5 file
  --save-per-protein-embeddings  Save the ProstT5 embeddings as means per
                                 protein in a h5 file
  --mask-threshold FLOAT         Masks 3Di residues below this value of
                                 ProstT5 confidence for Foldseek searches
                                 [default: 25]
  -e, --evalue FLOAT             Evalue threshold for Foldseek  [default:
                                 1e-3]
  -s, --sensitivity FLOAT        Sensitivity parameter for foldseek  [default:
                                 9.5]
  --keep-tmp-files               Keep temporary intermediate files,
                                 particularly the large foldseek_results.tsv
                                 of all Foldseek hits
  --max-seqs INTEGER             Maximum results per query sequence allowed to
                                 pass the prefilter. You may want to reduce
                                 this to save disk space for enormous datasets
                                 [default: 1000]
  --ultra-sensitive              Runs baktfold with maximum sensitivity by
                                 skipping Foldseek prefilter. Not recommended
                                 for large datasets.
  --extra-foldseek-params TEXT   Extra foldseek search params
  --custom-db TEXT               Path to custom database
  --foldseek-gpu                 Use this to enable compatibility with
                                 Foldseek-GPU search acceleration
  --custom-annotations PATH      Custom Foldseek DB annotations, 2 column tsv.
                                 Column 1 matches the Foldseek headers, column
                                 2 is the description.
  --euk                          Eukaryotic input genome.
  --fast                         Skips Foldseek search against AFDB Clusters.
  -a, --all-proteins             annotate all proteins (not just
                                 hypotheticals)
```

## Output

The majority of outputs match [Bakta](https://github.com/oschwengers/bakta?tab=readme-ov-file#input-and-output). Specifically, all the format compliant outputs match Bakta's.

The differences are:

- `<prefix>.inference.tsv` is different compared to Bakta.
- In Baktfold, this file gives a quick overview of the different Baktfold databases for which the query protein has a hit (if any)

For example:

```bash
ID	Length	Product	Swissprot	AFDBClusters	PDB	CATH
MEGJMNBEGN_27	162	HTH-type quorum-sensing regulator RhlR	swissprot_P54292	afdbclusters_A0A9E1VSB0	pdb_5l09	cath_3sztB01
MEGJMNBEGN_30	68	hypothetical protein				
MEGJMNBEGN_70	94	hypothetical protein		afdbclusters_A0A1I3V7E0		
```

- `<prefix>_<database>_tophit.tsv` files give the detailed Foldseek alignment information for each tophit found for each database.

For example:

```bash
query	target	bitscore	fident	evalue	qStart	qEnd	qLen	qCov	tStart	tEnd	tLen	tCov
MEGJMN_070	AF-A0A1I3V7E0-F1-model_v6	292	0.41	2.619e-06	1	91	93	0.97	1	95	99	0.95
```

- The full Foldseek search outputs are not kept by default (only tophits). You can keep the full Foldseek search TSVs using `--keep-tmp-files`. They will be called `foldseek_results_<database>.tsv`.
- `baktfold_3di.fasta` which gives the 3Di tokens for each input CDS
- `baktfold_prostT5_3di_mean_probabilities.csv` and `baktfold_prostT5_3di_all_probabilities.json`, which give some score of the confidence ProstT5 has in its predictions. You can disable this output with `--omit-probs`
- Baktfold does not have plotting functionality like Bakta (yet)

### Conceptual terms

As Baktfold inherits annotations and related conceptual terms from Bakta. Hence, we kindly refer to Bakta's [readme](https://github.com/oschwengers/bakta?tab=readme-ov-file#annotation-workflow).
In addition, Baktfold introduces one conceptual term:

- `PSTC`: protein structure clusters. These comprise of structure-based annotations to any of Baktfold's databases

## Citations

Please cite our preprint

> George Bouras, Sung won Lim, Lindsay Durr, Sarah Vreugde, Alexander Goesmann, Robert A. Edwards, Oliver Schwengers
> Baktfold: Sensitive protein functional annotation across the microbial tree of life using structural information
> bioRxiv 2026.03.31.715528; doi: https://doi.org/10.64898/2026.03.31.715528

Please also be sure to cite the following core dependencies - citing all bioinformatics tools that you use helps us, so helps you get better bioinformatics tools:

- Foldseek - (https://github.com/steineggerlab/foldseek) van Kempen M, Kim S, Tumescheit C, Mirdita M, Lee J, Gilchrist C, Söding J, and Steinegger M. Fast and accurate protein structure search with Foldseek. Nature Biotechnology (2023), [doi:10.1038/s41587-023-01773-0](https://www.nature.com/articles/s41587-023-01773-0)
- ProstT5 - (https://github.com/mheinzinger/ProstT5) Michael Heinzinger, Konstantin Weissenow, Joaquin Gomez Sanchez, Adrian Henkel, Martin Steinegger, Burkhard Rost. ProstT5: Bilingual language model for protein sequence and structure. NAR Genomics and Bioinformatics (2024) [doi:10.1101/2023.07.23.550085](https://doi.org/10.1093/nargab/lqae150) 

Please also consider citing these databases where relevant:

- AFDB/SwissProt - Mihaly Varadi, Damian Bertoni, Paulyna Magana, Urmila Paramval, Ivanna Pidruchna, Malarvizhi Radhakrishnan, Maxim Tsenkov, Sreenath Nair, Milot Mirdita, Jingi Yeo, Oleg Kovalevskiy, Kathryn Tunyasuvunakool, Agata Laydon, Augustin Žídek, Hamish Tomlinson, Dhavanthi Hariharan, Josh Abrahamson, Tim Green, John Jumper, Ewan Birney, Martin Steinegger, Demis Hassabis, Sameer Velankar, AlphaFold Protein Structure Database in 2024: providing structure coverage for over 214 million protein sequences, Nucleic Acids Research, Volume 52, Issue D1, 5 January 2024, Pages D368–D375, [https://doi.org/10.1093/nar/gkad1011](https://doi.org/10.1093/nar/gkad1011)
- CATH - Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM. CATH--a hierarchic classification of protein domain structures. Structure. 1997 Aug 15;5(8):1093-108. doi: 10.1016/s0969-2126(97)00260-8. PMID: 9309224.
- PDB - H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, P.E. Bourne, The Protein Data Bank (2000) Nucleic Acids Research 28: 235-242 [https://doi.org/10.1093/nar/28.1.235](https://doi.org/10.1093/nar/28.1.235)
