Metadata-Version: 2.4
Name: genome_uploader
Version: 2.5.2
Summary: Python script to upload bins and MAGs in fasta format to ENA (European Nucleotide Archive). This script generates xmls and manifests necessary for submission with webin-cli.
Author-email: MGnify team <metagenomics-help@ebi.ac.uk>
License: Apache Software License 2.0
Project-URL: Homepage, https://github.com/EBI-Metagenomics/genome_uploader
Project-URL: Issues, https://github.com/EBI-Metagenomics/genome_uploader/issues
Keywords: bioinformatics,tool,metagenomics
Classifier: Programming Language :: Python :: 3.11
Classifier: Development Status :: 5 - Production/Stable
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Operating System :: OS Independent
Requires-Python: >=3.11
Description-Content-Type: text/markdown
License-File: LICENSE.md
Requires-Dist: requests>=2.26
Requires-Dist: pandas>=1.4
Requires-Dist: numpy>=1.21
Requires-Dist: python-dotenv>=1.0
Requires-Dist: click==8.1.8
Requires-Dist: mgnify-pipelines-toolkit>=1.4.5
Provides-Extra: dev
Requires-Dist: pre-commit==3.3.3; extra == "dev"
Requires-Dist: black==23.7.0; extra == "dev"
Requires-Dist: ruff==v0.0.286; extra == "dev"
Requires-Dist: isort==5.12.0; extra == "dev"
Requires-Dist: bump-my-version==0.9.2; extra == "dev"
Provides-Extra: test
Requires-Dist: pytest==7.4.0; extra == "test"
Requires-Dist: pytest-md==0.2.0; extra == "test"
Requires-Dist: pytest-workflow==2.0.1; extra == "test"
Requires-Dist: pytest-cov==3.0.0; extra == "test"
Requires-Dist: responses==0.23.1; extra == "test"
Dynamic: license-file

# ENA public Bins and MAGs uploader
This repository allows to:

  * Generate xmls and manifests necessary for genome submission
  * Link the genomes you want to submit with the samples/runs used to generate them
  * Upload bins and MAGs in fasta format to ENA (European Nucleotide Archive) with webin-cli 

## How it works

When you submit genomes to the [ENA](https://www.ebi.ac.uk/ena/browser/home), you need to register a sample for every genome containing all the relevant metadata describing the genome and the sample of origin. The `genome_uploader` acts as the main linker to preserve sample metadata as much as possible. For every genome to register, you need an [INSDC](https://www.insdc.org/) run or assembly accession associated to the genome in order for the script to inherit its relevant metadata. On top of those metadata, the script adds metadata specified by the user that are specific to the genome, like taxonomy, statistics, or the tools used to generate it. The metadata that ENA requires are descibed in the checklist for [MAGs](<https://www.ebi.ac.uk/ena/browser/view/ERC000050>) and for [bins](<https://www.ebi.ac.uk/ena/browser/view/ERC000047>), respectively.

### Prepare Input TSV
The `genome_uploader` takes as input one tsv (tab-separated values) table in the following format:

| genome_name | genome_path | accessions | assembly_software | binning_software | binning_parameters | stats_generation_software | completeness | contamination | genome_coverage | metagenome | co-assembly | broad_environment | local_environment | environmental_medium | rRNA_presence | NCBI_lineage |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| ERR4647712_crispatus | path/to/ERR4647712.fa.gz | ERR4647712 | megahit_v1.2.9 | MGnify-genomes-generation-pipeline_v1.0.0 | default | CheckM2_v1.0.1 | 100 | 0.38 | 14.2 | chicken gut metagenome | False | chicken | gut | mucosa | True | d__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Lactobacillaceae;g__Lactobacillus;s__Lactobacillus crispatus |

With columns indicating:
  * _genome_name_: genome id (unique string identifier)
  * _accessions_: run(s) or assembly(ies) the genome was generated from (DRR/ERR/SRRxxxxxx for runs, DRZ/ERZ/SRZxxxxxx for assemblies). If the genome was generated by a co-assembly of multiple runs, separate them with a comma.
  * _assembly_software_: assemblerName_vX.X
  * _binning_software_: binnerName_vX.X
  * _binning_parameters_: binning parameters
  * _stats_generation_software_: software_vX.X
  * _completeness_: `float`
  * _contamination_: `float`
  * _rRNA_presence_: `True/False` if all among 5S, 16S, and 23S genes, and at least 18 tRNA genes, have been detected in the genome
  * _NCBI_lineage_: full NCBI lineage - format: `x;y;z;...`. The same organism can be described in two different ways:  either in tax ids (`integers`) or `strings`. For example, the lineage for _E. coli_ can be:
    * `Bacteria;Pseudomonadati;Pseudomonadota;Gammaproteobacteria;Enterobacterales;Enterobacteriaceae;Escherichia`
    * `2;1224;1236;91347;543;561;562`
  * _metagenome_: needs to be listed in the taxonomy tree [here](<https://www.ebi.ac.uk/ena/browser/view/408169?show=tax-tree>) (you might need to press "Tax tree - Show" in the right most section of the page)
  * _co-assembly_: `True/False`, whether the genome was generated from a co-assembly. N.B. the script only supports co-assemblies generated from the same project.
  * _genome_coverage_ : genome coverage against raw reads
  * _genome_path_: path to genome to upload (already compressed)
  * _broad_environment_: `string` (explanation following)
  * _local_environment_: `string` (explanation following)
  * _environmental_medium_: `string` (explanation following)

According to ENA checklist's guidelines, `broad_environment` describes the broad ecological context of a sample - desert, taiga, coral reef, ... `local_environment` is more local - lake, harbour, cliff, ... `environmental_medium` is either the material displaced by the sample, or the one in which the sample was embedded prior to the sampling event - air, soil, water, ...
For host-associated metagenomic samples, the three variables can be defined similarly to the following example for the chicken gut metagenome: "chicken digestive system", "digestive tube", "caecum". More information can be found at [ERC000050](<https://www.ebi.ac.uk/ena/browser/view/ERC000050>) for bins and [ERC000047](<https://www.ebi.ac.uk/ena/browser/view/ERC000047>) for MAGs under field names "broad-scale environmental context", "local environmental context", "environmental medium"

An example of input tsv table can be found [here](examples/input_example.tsv)

## _Warnings_

### Mandatory vs Optional Fields
All fields above are mandatory for MAG submission (see ENA's MAGs checklist [here](<https://www.ebi.ac.uk/ena/browser/view/ERC000047>)). However, if you are registering bins, you cand decide whether to omit the following fields: `completeness`, `contamination` and `rRNA_presence`(see ENA's bins checklist [here](<https://www.ebi.ac.uk/ena/browser/view/ERC000050>)). These values are used together to determine MAG quality according to MIMAG criteria (described [here](https://www.nature.com/articles/nbt.3893/tables/1)).

If you already generated these for your bins, our recommendation is to include them for shareability and to describe your sample more accurately.

### Existing accessions in the INSDC
Raw-read runs or assemblies from which genomes were generated should already be available on the INSDC (ENA by EBI, GenBank by NCBI, or DDBJ) for this script to work. Therefore, at least a DRR|ERR|SRR accession (for runs) or a ERZ|SRZ|DRZ accession (for assemblies) should be available.  

If you are working with your own, private data on ENA, you will need to add the `--private` flag to access private metadata through ENA API. This implies that if you are working on public data, you can omit the flag. However, you will need to submit two different batches of data if you are handling both private and public data.

### TPA generation and upload
If uploading TPA (Third PArty) genomes, you will need to contact [ENA support](<https://www.ebi.ac.uk/ena/browser/support>) before using the script. They will provide instructions on how to correctly register a TPA project where to submit your genomes. If both TPA and non-TPA genomes need to be uploaded, please divide them in two batches and use the `--tpa` flag only with TPA genomes.

### Compress your fasta files
Files to be uploaded will need to be compressed (e.g. already in .gz format).

### Split your input tables
No more than 5000 genomes can be submitted at the same time. If you have more than 5000, split your table into smaller ones and launch the `genome_uploader` for each table.

## Installation

### Installation with conda (recommended)

Command will install all necessary dependencies into `genomeuploader` environment
```bash
conda install bioconda::genome-uploader
```

### Installation with pip 
Install `genome_uploader` with:

```bash
pip install genome_uploader
```
Additionally, you need to download [the webin-cli.jar](https://github.com/enasequence/webin-cli) from the [latest release](https://github.com/enasequence/webin-cli/releases). 


## Setting ENA Credentials

This tool requires your ENA Webin credentials to function. You can provide these by setting environment variables or using an environment file.

### Using an environment file

Create a file named `.env` in your home directory (`~/.env`), your current working directory (`./.env`), or specify a custom file (default is `.env`).

Add the following lines with your credentials:

```env
ENA_WEBIN=your_username_here
ENA_WEBIN_PASSWORD=your_password_here
```

### Alternatively, set the environment variables directly in your shell

```bash
export ENA_WEBIN=your_username_here
export ENA_WEBIN_PASSWORD=your_password_here
```

## Run

### Generate files for upload
Run `genome_uploader` with input TSV:

```bash
genome_upload \
  -u UPLOAD_STUDY \
  --genome_info METADATA_FILE \
  (--mags | --bins) \
  --centre_name CENTRE_NAME \
  [--out] [--force] [--live] [--tpa]
```

where
  * `-u UPLOAD_STUDY`: study accession for genomes upload to ENA (in format ERPxxxxxx or PRJEBxxxxxx)
  * `--genome_info METADATA_FILE` : genomes metadata file in tsv format
  * `-m, --mags, --b, --bins`: select either of these for bin **or** MAG upload. If in doubt, check [which definition fits best according to ENA](<https://ena-docs.readthedocs.io/en/latest/submit/assembly/metagenome.html>)
  * `--out`: output folder (default: working directory)
  * `--force`: forces reset of sample xmls generation. This is useful if you changed something in your tsv table, or if ENA metadata haven't been downloaded correctly (you can check this in `ENA_backup.json`).
  * `--live`: registers genomes on ENA's live server. Omitting this option allows to validate samples beforehand (it will need the `-test` option in the upload command for the test submission to work)
  * `--centre_name CENTRE_NAME`: name of the centre generating and uploading genomes
  * `--tpa`: if uploading TPA (Third PArty) generated genomes
  * `--private`: if data is private

> [!NOTE]
> It is recommended to **validate** your genomes in test mode (i.e. without the `--live` argument in the registration step) before attempting the final upload. 
> Test run will proceed on the ENA's TEST server. Launching the registration in test mode will add a timestamp to the genome name to allow multiple executions of the test process.
> If no errors occur, then re-run the command **with** the `--live` argument for a live registration to ENA's REAL server.

Sample xmls won't be regenerated automatically if a previous xml already exists. If any metadata or value in the tsv table changes, `--force` will allow xml regeneration.

#### Produced files:
The script produces the following files and folders:
```bash
bin_upload/MAG_upload
├── manifests
│    └── ...
├── manifests_test                  # folder generated for validation in test mode
│    └── ...
├── ENA_backup.json                 # backup file to prevent re-download of metadata from ENA. Regeneration can be forced with --force
├── genome_samples.xml              # xml generated to register samples on ENA before the upload
├── registered_bins/MAGs.tsv        # list of genomes registered on ENA in live mode - needed for manifest generation
├── registered_bins/MAGs_test.tsv   # list of genomes registered on ENA in test mode - needed for manifest generation
└── submission.xml                  # xml used for genome registration on ENA
```

An example of output files and folder structure submitted in test mode can be found under the `examples` folder.

### Upload genomes
Once manifest files are generated, it is necessary to use ENA's [webin-cli](https://github.com/enasequence/webin-cli) resource to upload genomes.

More information about ENA's webin-cli can be found [in the ENA docs](<https://ena-docs.readthedocs.io/en/latest/submit/general-guide/webin-cli.html>).

We recommend using a pre-installed [**webin_cli_handler**](https://github.com/EBI-Metagenomics/mgnify-pipelines-toolkit/blob/dev/mgnify_pipelines_toolkit/ena/webin_cli_handler.py) script.

> [!NOTE]
>
> First, validate your submission with the `--mode validate`. \
> Second, upload to the ENA's TEST server using the `--test` flag (make sure you have validated your run on _Generate files for upload_ step) and `--mode submit`.
> Finally, upload to ENA's REAL server using `--mode submit` without `--test`.

Run live execution:

```bash
webin_cli_handler \
  --manifest *.manifest \
  --context genome \
  --mode submit \
  [--test]
```
If you do not have **ena-webin-cli** installed add the `--download-webin-cli` flag. The tool will be automatically downloaded. It requires a recent JAVA version to be able to work following [official repo](https://github.com/enasequence/webin-cli). \
If you want to use your local Java .jar (downloaded after pip installation) provide it with `--webin-cli-jar`.

Other options:
```bash
webin_cli_handler 

  -h, --help            show this help message and exit
  -m, --manifest MANIFEST
                        Manifest text file containing file and metadata fields
  -c, --context {genome,transcriptome,sequence,polysample,reads,taxrefset}
                        Submission type: genome, transcriptome, sequence, polysample, reads, taxrefset
  --mode {submit,validate}
                        submit or validate
  --test                Specify to use test server instead of live
  --workdir WORKDIR     Path to working directory
  --download-webin-cli  Specify if you do not have ena-webin-cli installed
  --download-webin-cli-directory DOWNLOAD_WEBIN_CLI_DIRECTORY
                        Path to save webin-cli into
  --download-webin-cli-version DOWNLOAD_WEBIN_CLI_VERSION
                        Version of ena-webin-cli to download, default: latest
  --webin-cli-jar WEBIN_CLI_JAR
                        Path to pre-downloaded webin-cli.jar file to execute
  --retries RETRIES     Number of retry attempts (default: 3)
  --retry-delay RETRY_DELAY
                        Initial retry delay in seconds (default: 5)
  --java-heap-size-initial JAVA_HEAP_SIZE_INITIAL
                        Java initial heap size in GB (default: 10)
  --java-heap-size-max JAVA_HEAP_SIZE_MAX
                        Java maximum heap size in GB (default: 10)
```

## Devs section

### Testing submission in normal mode vs strict submission
ENA's test servers reset every day. This means that if you try to register the same set of samples more than once in a single day, the request will fail because the automatically generated aliases would result as duplicates on ENA's servers. To prevent this issue, when you register samples in test mode, the `genome_uploader` appends a timestamp to each generated alias. This ensures that you can repeat your tests multiple times without running into duplicate-alias errors.

However, when debugging or checking the script’s behavior in development mode, you might want the aliases to remain consistent across runs, so that repeated submissions refer to the same sample. To allow this, you can use the `--test-suffix` flag when running `genome_upload.py`, which lets you define a custom suffix instead of the automatic timestamp. This gives you more control over how sample aliases are generated during testing.
