Metadata-Version: 2.4
Name: ecolityper
Version: 1.1.1
Summary: Comprehensive E. coli Typing Pipeline: MLST, Serotyping, CH Typing, Phylogrouping, AMR, and Virulence Analysis
Home-page: https://github.com/bbeckley-hub/EcoliTyper
Author: Brown Beckley
Author-email: brownbeckley94@gmail.com
Project-URL: Bug Reports, https://github.com/bbeckley-hub/EcoliTyper/issues
Project-URL: Source, https://github.com/bbeckley-hub/EcoliTyper
Project-URL: Documentation, https://github.com/bbeckley-hub/EcoliTyper
Keywords: bioinformatics,ecoli,typing,mlst,serotyping,amr,virulence,genomics
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: pandas>=1.5.0
Requires-Dist: biopython>=1.80
Requires-Dist: psutil>=5.9.0
Requires-Dist: requests>=2.28.0
Requires-Dist: tqdm>=4.64.0
Requires-Dist: click>=8.0.0
Requires-Dist: tabulate>=0.9.0
Requires-Dist: ezclermont>=0.7.0
Requires-Dist: cgecore>=1.5.6
Requires-Dist: beautifulsoup4>=4.11.0
Requires-Dist: lxml>=4.9.0
Requires-Dist: matplotlib>=3.5.0
Requires-Dist: seaborn>=0.12.0
Requires-Dist: scipy>=1.10.1
Provides-Extra: full
Requires-Dist: plotly>=5.10.0; extra == "full"
Requires-Dist: scipy>=1.9.0; extra == "full"
Provides-Extra: visualization
Requires-Dist: plotly>=5.10.0; extra == "visualization"
Requires-Dist: scipy>=1.9.0; extra == "visualization"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: keywords
Dynamic: license-file
Dynamic: project-url
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

<p align="center">
  <img src="https://raw.githubusercontent.com/bbeckley-hub/ecoliTyper/main/ecolityper_banner.png" alt="ecoliTyper Banner" width="100%">
</p>

<div align="center">

### 🧬 A species-optimized computational pipeline for comprehensive genotyping and surveillance of **_Escherichia coli_**

**Complete *E. coli* genomic analysis in minutes — not hours**

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![Platform](https://img.shields.io/badge/Platform-Linux%20%7C%20macOS%20%7C%20Windows-lightgrey.svg)](https://github.com/bbeckley-hub/EcoliTyper)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.17761775.svg)](https://doi.org/10.5281/zenodo.17761775)
[![GitHub stars](https://img.shields.io/github/stars/bbeckley-hub/EcoliTyper)](https://github.com/bbeckley-hub/EcoliTyper/stargazers)

[![Docker Pulls](https://img.shields.io/docker/pulls/bbeckleyhub/ecolityper)](https://hub.docker.com/r/bbeckleyhub/ecolityper)
[![Docker Image Size](https://img.shields.io/docker/image-size/bbeckleyhub/ecolityper/latest)](https://hub.docker.com/r/bbeckleyhub/ecolityper)
[![Docker Version](https://img.shields.io/docker/v/bbeckleyhub/ecolityper?sort=semver)](https://hub.docker.com/r/bbeckleyhub/ecolityper)

[![Conda](https://img.shields.io/badge/conda-✓-green.svg)](https://docs.conda.io/en/latest/)
[![MIT License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)
[![GitHub Issues](https://img.shields.io/github/issues/bbeckley-hub/EcoliTyper)](https://github.com/bbeckley-hub/EcoliTyper/issues)
![Latest Release Date](https://anaconda.org/bbeckley-hub/staphscope/badges/latest_release_date.svg)


[![Contributions Welcome](https://img.shields.io/badge/contributions-welcome-brightgreen.svg)](#)
[![LinkedIn](https://img.shields.io/badge/LinkedIn-Profile-0A66C2?style=flat&logo=linkedin&logoColor=white)](https://www.linkedin.com/in/brown-beckley-190315319)
[![Stage](https://img.shields.io/badge/status-active-brightgreen)](#)


[![Conda](https://img.shields.io/badge/conda-✓-green.svg)](https://docs.conda.io/en/latest/)
[![Sample Report](https://img.shields.io/badge/📊-View_Sample_Report-blue)](https://htmlpreview.github.io/?https://bbeckley-hub.github.io/EcoliTyper/#summary)
![Profile Views](https://komarev.com/ghpvc/?username=bbeckley-hub&label=Profile%20Views&color=0e75b6&style=flat)
[![Google Scholar](https://img.shields.io/badge/Google%20Scholar-Profile-4285F4?style=flat&logo=googlescholar&logoColor=white)](https://scholar.google.com/citations?user=CYNOsqIAAAAJ&hl=en)



![GitHub stats](https://github-readme-stats.vercel.app/api?username=bbeckley-hub&show_icons=true&theme=radical)
![Top Langs](https://github-readme-stats.vercel.app/api/top-langs/?username=bbeckley-hub&layout=compact&theme=radical)
[![GitHub Streak](https://streak-stats.demolab.com?user=bbeckley-hub&theme=radical&date_format=j%20M%5B%20Y%5D)](https://git.io/streak-stats)



**Perfect for clinical microbiology, outbreak investigations, and genomic research.**

</div>

---

## 📋 **Table of Contents**

- [🌟 Overview](#-overview)
- [✨ Core Features](#-core-features)
- [🛠️ Installation](#️-installation)
- [🎯 Usage Examples](#-usage-examples)
- [📊 Output Structure](#-output-structure)
- [🎨 Interactive Report Features](#-interactive-report-features)
- [🔗 Integrated External Tools & Dependencies](#-integrated-external-tools--dependencies)
- [🤖 AI Integration Guide](#-ai-integration-guide)
- [🌍 EcoliDB Lineage Database](#-ecolityper-ecolidb-lineage-database)
- [⚡ Performance Benchmarks](#-performance-benchmarks)
- [🆚 Competitive Comparison](#-competitive-comparison)
- [📚 Citation](#-citation)
- [❓ Frequently Asked Questions](#-frequently-asked-questions)
- [🤝 Contributing](#-contributing)
- [🐛 Issue Reporting](#-issue-reporting)
- [⚠️ Limitations & Considerations](#️-limitations--considerations)
- [📜 License & Third-Party Components](#-license--third-party-components)
- [👥 Authors & Affiliations](#-authors--affiliations)
- [🙏 Acknowledgements](#-acknowledgements)
- [🔮 Future Development Roadmap](#-future-development-roadmap)
- [📞 Support & Community](#-support--community)

---

## 🌟 **Overview**

**EcoliTyper** is a revolutionary bioinformatics pipeline designed to eliminate workflow fragmentation in *E. coli* genomic surveillance. By integrating **seven core genotyping analyses** into a single automated workflow, EcoliTyper transforms disconnected genomic data into coherent biological narratives with actionable public health intelligence.

> *"From fragmented analysis to integrated insight in one command"*

### 🚀 **The EcoliTyper Advantage**

| Traditional Workflow 😫 | EcoliTyper Solution 🎉 |
|------------------------|-----------------------|
| 7+ independent tools required | **Single unified pipeline** |
| Manual data integration & synthesis | **Automated cross-genome pattern discovery** |
| Hours of manual curation | **Intelligent risk assessment & alerting** |
| Disconnected epidemiological context | **Integrated lineage database of high-risk clones** |
| Multiple output formats to reconcile | **Consolidated HTML report + structured data (TSV/JSON)** |
| Complex installation & dependencies | **Self-contained Conda package** |

**Key Achievement:** Processes 30 *E. coli* genomes in **~41 minutes** on 16 CPU cores with **perfect concordance** against reference tools.

---

## ✨ **Core Features**

### 🧩 **Comprehensive *E. coli* Typing Suite**
- **🧬 Multi-Locus Sequence Typing (MLST)** – Achtman scheme with PubMLST database
- **🔍 In silico Serotyping** – O and H antigen determination via SerotypeFinder (≥90% coverage/identity)
- **🎯 CH Typing** – High-resolution *fumC/fimH* typing for fine-scale discrimination
- **🌳 Clermont Phylogrouping** – Evolutionary context with 2013 scheme (8 phylogroups)
- **💊 Antimicrobial Resistance Profiling** – Dual screening via ABRicate (9 databases) & NCBI-AMRFinderPlus
- **🦠 Virulence Factor Detection** – Comprehensive pathogenicity assessment
- **📊 Plasmid Replicon Typing** – Mobile genetic element characterization

### 🧠 **Intelligent Analytics Layer**
- **🔬 Cross-genome pattern discovery** – Automated gene frequency analysis & distribution mapping
- **⚠️ Rule-based clinical risk assessment** – Hierarchical alerting (CARBAPENEMASE > ESBL > COLISTIN-RES)
- **🌍 Integrated lineage database** – Manually curated reference of high-risk clones (ST131, ST1193, etc.)
- **📈 Population-level insights** – Immediate epidemiological overview of resistance cassettes & virulence profiles

### ⚡ **Performance Optimized Architecture**
- **🚀 Hybrid parallel execution** – Inter-module & intra-module parallelization
- **🎛️ Dynamic resource allocation** – Automatic scaling with genome complexity
- **⚖️ Memory-aware processing** – Strategic sequential execution for resource-intensive operations
- **🔄 Robust error handling** – Graceful recovery with checkpointing & automated cleanup

---

## 🛠️ **Installation**

### Quick Install (Recommended)
```bash
# Create and activate environment
conda create -n ecolityper -c conda-forge -c bioconda -c bbeckley-hub ecolityper -y
conda activate ecolityper
```

### 🐳 Docker Installation (Alternative)

If you prefer a containerized environment or cannot install Conda, use our Docker image. It includes all dependencies and pre‑configured databases – **no setup required**. Run the complete E. coli typing pipeline with zero installation – just Docker.

---

## 🚀 Quick Start

### Pull the image

```bash
docker pull bbeckleyhub/ecolityper:latest
```

### Run on a single FASTA file

```bash
docker run --rm -v $(pwd):/data bbeckleyhub/ecolityper:latest -i "/data/genome.fna" -o /data/output
```

After the run, output files are owned by `root` on your host. To reclaim ownership:

```bash
sudo chown -R $USER:$USER ./output
```

### Run on all FASTA files in the current directory

```bash
docker run --rm -v $(pwd):/data bbeckleyhub/ecolityper:latest -i "/data/*.fna" -o /data/output
```

---

## 📖 Detailed Usage

### Basic syntax

```bash
docker run --rm -v $(pwd):/data bbeckleyhub/ecolityper:latest [ECOLITYPER_OPTIONS]
```

- `--rm` : remove container after exit
- `-v $(pwd):/data` : mount current directory to `/data` inside container
- Input files must be under `/data` (e.g., `/data/*.fna`)
- Output directory must also be under `/data` (e.g., `/data/output`)

### All EcoliTyper options work

```bash
docker run --rm -v $(pwd):/data bbeckleyhub/ecolityper:latest \
  -i "/data/*.fna" -o /data/output \
  --threads 8 --skip-visualization
```

### Using custom threads

```bash
docker run --rm -v $(pwd):/data bbeckleyhub/ecolityper:latest \
  -i "/data/*.fna" -o /data/output -t 16
```

---

## 🔧 Handling File Permissions (The “Padlock” Issue)

By default, Docker runs as `root` inside the container. Any files written to your mounted directory will be owned by `root:root`.  
You have three options:

### 1. Change ownership after the run (easiest)

```bash
sudo chown -R $USER:$USER ./output
```

### 2. Run with your host user ID (requires a small code fix – coming soon)

Currently not fully supported because EcoliTyper needs to write to its own installation directory. A future update will fix this.

### 3. Use Singularity (recommended for HPC, no `sudo` needed)

See the [Singularity section](#singularity-for-hpc-no-sudo) below.

---

## 🧪 Testing Your Docker Setup

### Check help message

```bash
docker run --rm bbeckleyhub/ecolityper:latest -h
```

### Verify ABRicate databases are installed

```bash
docker run --rm --entrypoint /bin/bash bbeckleyhub/ecolityper:latest -c "abricate --list | head -5"
```

Expected output: list of databases (ncbi, card, vfdb, etc.)

---

## 🖥️ Singularity for HPC (no `sudo`, correct ownership)

On HPC clusters that support [Singularity/Apptainer](https://sylabs.io/singularity/), you can run EcoliTyper **without `sudo`** and output files will be owned by your user automatically.

> **Important:** EcoliTyper writes temporary files inside its own installation directory (e.g., `/opt/ecolityper/...`). Singularity mounts containers as read‑only by default, so you **must** add the `--writable-tmpfs` flag to allow these writes. The flag creates an ephemeral, writable overlay in memory – no permanent changes are made to the container.

### Option A: Direct pull (if network allows)

```bash
singularity pull ecolityper.sif docker://bbeckleyhub/ecolityper:latest
singularity run --writable-tmpfs -B $(pwd):/data ecolityper.sif -i "/data/*.fna" -o /data/output
```

### Option B: Convert from a local Docker image (when `singularity pull` fails)

If you encounter TLS timeouts or other network errors (common on some HPCs), convert an existing Docker image to a Singularity SIF file on a machine with Docker, then transfer the `.sif` file to the HPC.

**Step 1 – on a machine with Docker (e.g., your laptop):**

```bash
docker pull bbeckleyhub/ecolityper:latest
docker save bbeckleyhub/ecolityper:latest -o ecolityper.tar
singularity build ecolityper.sif docker-archive://ecolityper.tar
```

Now copy `ecolityper.sif` to your HPC home or project directory (e.g., using `scp`).

**Step 2 – on the HPC (no sudo needed):**

```bash
singularity run --writable-tmpfs -B $(pwd):/data ecolityper.sif -i "/data/*.fna" -o /data/output
```

### Explanation of flags

| Flag | Purpose |
|------|---------|
| `--writable-tmpfs` | Creates a temporary writable overlay – **required** for EcoliTyper to write intermediate files to `/opt/...` |
| `-B $(pwd):/data` | Binds your current directory to `/data` inside the container (input files are read from here, output is written here) |
| `-i "/data/*.fna"` | Input pattern – use quotes to prevent shell expansion on the host |
| `-o /data/output` | Output directory (will appear as `./output` on your host) |

### Additional options

You can use any EcoliTyper flag, e.g.:

```bash
singularity run --writable-tmpfs -B $(pwd):/data ecolityper.sif \
    -i "/data/*.fna" -o /data/output --threads 8 --skip-visualization
```

### Verify it works

After a successful run, you will see output indicating each module completed. All result files in `./output` will be owned by **your HPC user** – no `sudo chown` needed.


#### Docker Hub Repository

All releases are available at:  
[https://hub.docker.com/r/bbeckleyhub/ecolityper](https://hub.docker.com/r/bbeckleyhub/ecolityper)

---

### From Source
```bash
git clone https://github.com/bbeckley-hub/EcoliTyper.git
cd EcoliTyper
conda env create -f environment.yml
conda activate ecolityper
pip install -e .
```

### System Requirements
- **Minimum:** 2 CPU cores, 8 GB RAM
- **Recommended:** 8+ CPU cores, 16+ GB RAM for batch processing
- **OS:** Linux, macOS, or Windows (WSL2 recommended for Windows)

---

## 🎯 **Usage Examples**

### Basic Single Genome Analysis
```bash
ecolityper -i genome.fasta -o results_directory/
```

### High-Throughput Batch Processing
```bash
# Process all FASTA files in current directory
ecolityper -i "*.fasta" -o batch_results --threads 8

# Process specific pattern
ecolityper -i "GCF_*.fna" -o surveillance_run --threads 16
```

### Customized Analysis Workflows
```bash
# Skip specific modules for faster processing
ecolityper -i isolates/ -o quick_typing --skip-amrfinder --skip-visualization

# Minimum typing only
ecolityper -i sample.fna -o basic_results --skip-lineage --skip-summary
```

### Complete Command Reference
```
usage: ecolityper [-h] -i INPUT -o OUTPUT [-t THREADS] [--skip-amrfinder]
                  [--skip-abricate] [--skip-mlst] [--skip-serotyping]
                  [--skip-chtyper] [--skip-phylogrouping] [--skip-lineage]
                  [--skip-summary] [--skip-visualization]

EcoliTyper: Complete E. coli Typing Pipeline

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Input FASTA file(s) - can use glob patterns like "*.fna" or "*.fasta"
  -o OUTPUT, --output OUTPUT
                        Output directory for all results
  -t THREADS, --threads THREADS
                        Number of threads (default: 2)
  --skip-amrfinder      Skip AMRfinderPlus analysis
  --skip-abricate       Skip ABRicate analysis
  --skip-mlst           Skip MLST analysis
  --skip-serotyping     Skip serotyping analysis
  --skip-chtyper        Skip CH typing analysis
  --skip-phylogrouping  Skip phylogrouping analysis
  --skip-lineage        Skip lineage reference generation
  --skip-summary        Skip summary report generation
  --skip-visualization  Skip visualization generation

Examples:
  ecolityper -i genome.fna -o results/
  ecolityper -i "*.fna" -o batch_results --threads 8
  ecolityper -i "*.fasta" -o analysis --threads 16 --skip-lineage
  ecolityper -i "genome*.fa" -o results/ --threads 4

Supported FASTA formats: .fna, .fasta, .fa, .fsa

Analysis Modules:
  • MLST (Multi-Locus Sequence Typing)
  • Serotyping (O and H antigen determination)
  • CH Typing (FumC and FimH typing)
  • Phylogrouping (Clermont algorithm)
  • ABRicate (Resistance/Virulence/Plasmid screening)
  • AMRfinderPlus (NCBI AMR gene detection)
  • Lineage reference database
  • Summary Reports (HTML summary reports)
  • Visualizations (Charts and visualizations)

Output: Comprehensive results for all analyses in organized directories
```

---

## 📊 **Output Structure**

```
results_directory/
├── 📄 abricate_results/              # Multi-database screening (CARD, ResFinder, VFDB, etc.)
│   ├── ecoli_*_summary.json         # Consolidated JSON summaries
│   ├── ecoli_*_summary_report.html  # Interactive HTML reports
│   └── per_sample/                  # Individual genome results
├── 🔬 amrfinder_results/             # NCBI AMRFinderPlus outputs
│   ├── ecoli_amrfinder_summary.tsv
│   ├── ecoli_amrfinder_summary_report.html
│   └── per_sample/
├── 🎯 chtyper_results/               # High-resolution CH typing
│   ├── chtyper_results.tsv
│   ├── chtyper_results.html
│   └── per_sample/
├── 🧬 mlst_results/                  # Multi-Locus Sequence Typing
│   ├── mlst_summary.tsv
│   ├── mlst_summary.html
│   └── per_sample/
├── 🌳 phylogrouping_results/         # Clermont phylogrouping
│   ├── phylogrouping_results.tsv
│   ├── phylogrouping_results.html
│   └── per_sample/
├── 🔍 serotyping_results/            # O:H antigen typing
│   ├── serotype_analysis_report.tsv
│   ├── serotype_analysis_report.html
│   └── per_sample/
├── 🌍 lineage_results/               # Epidemiological context
│   └── ecoli_comprehensive_reference.html
├── 📈 summary_results/               # Consolidated reports
│   └── GENIUS_ULTIMATE_REPORTS/
│       ├── genius_ultimate_report.html     # Main interactive report
│       ├── genius_ultimate_report.json
│       ├── amr_genes.csv
│       ├── virulence_genes.csv
│       └── pattern_discovery.csv
└── 🎨 visualization_results/         # Publication-ready figures
    └── ECOLI_VISUALIZATIONS/
        ├── PDF/     # Vector graphics
        ├── PNG/     # Raster images
        ├── SVG/     # Scalable vector graphics
        └── DATA/    # Source data for figures
```

## 📊 Sample Output

See a complete interactive report generated by EcoliTyper:

[![Sample Report](https://img.shields.io/badge/📊-View_Sample_Report-red)](https://htmlpreview.github.io/?https://bbeckley-hub.github.io/EcoliTyper/#summary)

*The report includes AMR and virulence gene tables, filter buttons, combination tables, and FASTA QC metrics.*

---

## 🎨 **Interactive Report Features**

### **Main Dashboard**
- **Sample Overview**: Quick glance at typing results across all genomes
- **Risk Alert Panel**: Automatic flagging of high-priority resistance markers
- **Epidemiological Context**: Lineage information for identified clones

### **Cross-Genome Analysis**
- **Gene Frequency Tables**: Prevalence of AMR/virulence genes across population
- **Pattern Discovery**: Identification of common resistance cassettes
- **Distribution Maps**: Visual representation of gene carriage

### **Visualization Gallery**
- **Stacked Bar Charts**: MLST, serotype, and phylogroup distributions
- **Violin Plots**: Quantitative metrics distribution
- **Pie Charts**: Phylogroup and serotype proportions
- **Heatmaps**: Gene presence/absence patterns

---

## 🔗 **Integrated External Tools & Dependencies**

EcoliTyper integrates several powerful open-source tools and databases. These are **not bundled directly in this repository**. Instead, they are automatically installed as **dependencies via Conda** (as defined in `environment.yml`). The MIT license that applies to the EcoliTyper pipeline code does not cover these external tools. Each tool is used under the terms of its own license, and we gratefully acknowledge their authors.

| Tool/Database | Purpose | Source | License |
|---------------|---------|--------|---------|
| **MLST** | Multi-locus sequence typing | [tseemann/mlst](https://github.com/tseemann/mlst) | GPL v2 |
| **ABRicate** | Mass screening for resistance/virulence | [tseemann/abricate](https://github.com/tseemann/abricate) | GPL v2 |
| **AMRFinderPlus** | AMR gene detection | [ncbi/amr](https://github.com/ncbi/amr) | Public Domain |
| **SerotypeFinder** | O:H antigen typing | [CGE](https://bitbucket.org/genomicepidemiology/serotypefinder_db/) | Apache 2.0 |
| **CHTyper DB** | *fumC/fimH* typing | [CGE](https://bitbucket.org/genomicepidemiology/chtyper_db/) | Free for research |
| **ezClermont** | Phylogrouping | [https://github.com/nickp60/ezClermont](https://github.com/nickp60/ezClermont) | MIT |

### **AMR & Virulence Databases (via ABRicate)**
| Database | Purpose | License |
|----------|---------|---------|
| **CARD** | Comprehensive antibiotic resistance | Free for research |
| **ResFinder** | Acquired antimicrobial resistance | Free for research |
| **NCBI** | NCBI bacterial AMR reference | Public Domain |
| **ARG-ANNOT** | Antibiotic resistance gene annotation | Free for research |
| **MEGARES** | Comprehensive resistance database | Free for research |
| **VFDB** | Virulence factors | Free for research |
| **EcoH** | *E. coli* hemolysins | Free for research |
| **Ecoli_VF** | *E. coli* virulence factors | Free for research |
| **PlasmidFinder** | Plasmid replicons | Free for research |

---
## 🤖 **AI Integration Guide**

EcoliTyper generates comprehensive HTML reports that are **perfect for AI analysis**. Here's how to use AI tools to get more from your data.

### 🚀 Quick Start
1. **Install any AI browser extension** (ChatGPT, Claude, Gemini)
2. **Open your report**: `genius_ultimate_report.html`
3. **Select text** in any section (AMR Genes, MLST Analysis,Serotype analysis, CH type etc.)
4. **Right-click → Ask AI** with your question

### 💡 Example Questions

**For MLST Analysis:**
- "What is the clinical significance of ST21 vs ST10?"
- "Which ST-Serotype-Phylogrouping-CH types combinations are hypervirulent?"

**For AMR Genes:**
- "Explain the OmpA gene and its importance"
- "Which samples have multiple resistance genes?"
- "What treatment implications do these genes have?"

**For Virulence Factors:**
- "Which samples carry espK?"
- "Are there any high-risk virulence combinations?"

**For Pattern Discovery:**
- "Are there correlations between ST and specific genes?"
- "Identify any concerning patterns in this dataset"
- 
**For Publication & Manuscript Summary:**
- Select the sample overview section and ask AI "Summarize the population overview for my E. coli results"

**SUPER-TRICK FOR CHATGPT & CLAUDE AI USERS:**
- Upload the `genius_ultimate_report.html` reporter and ask any question in any section.
- From interaction to insights in minutes...."Summarize the Sample overview section as the First results for my Manuscript"

### 📊 Pro Tips
- **Provide context**: "I'm analyzing *E. coli* genomics data..."
- **Be specific**: Instead of "tell me about this", ask "what does ST21 	O26:H11-B1-fumC4:fimH440 combination indicate?"
- **Ask for interpretations**: "What are the clinical implications of these findings?"
- **Request summaries**: "Summarize the resistance profile of sample XYZ"

### ⚡ Why This Works
EcoliTyper reports are structured with clear tables and organized data that AI can easily understand. Each gene is shown with all genomes that contain it, making pattern analysis straightforward.

> *"AI provides powerful insights but always verify critical findings with domain experts."*

---

## 🌍 **EcoliDB Lineage Database** (EcoliTyper)

### **Overview**
EcoliTyper includes **EcoliDB**, a manually curated comprehensive reference database for rapid *E. coli* lineage contextualization. This database associates sequence types with clinical pathotypes, serotypes, and risk profiles to inform public health analysis.

### **Database Statistics**
- **12 Sequence Types** with detailed epidemiological profiles
- **13 Pathotypes** categorized (Diarrheagenic, Extraintestinal, Hybrid, Animal, Mucosal)
- **13 Serotypes** with clinical associations
- **8 Phylogroups** according to Clermont scheme
- **4 Carbapenemase Types** for resistance profiling
- **79 Scientific References** supporting the data

### **Included High-Risk Clones**
| Sequence Type | Risk Level | Primary Pathotype | Key Features |
|--------------|------------|-------------------|--------------|
| **ST131** | VERY HIGH | UPEC/ExPEC | Global MDR pandemic clone, CTX-M-15, fluoroquinolone resistance |
| **ST1193** | HIGH | UPEC/ExPEC | Emerging fluoroquinolone-resistant, community-associated UTIs |
| **ST95** | VERY HIGH | NMEC/ExPEC | Neonatal meningitis, high virulence, O18:H7 serotype |
| **ST405** | VERY HIGH | ExPEC | Global MDR, carbapenemase producers (OXA-48, NDM) |
| **ST410** | VERY HIGH | ExPEC | Emerging MDR, OXA-181/NDM-5 carbapenemases |
| **ST648** | VERY HIGH | Zoonotic MDR | Pan-drug resistance emerging, significant One Health concern |
| **ST11** | VERY HIGH | EHEC | O157:H7, hemorrhagic colitis, HUS risk |
| **ST10** | LOW-MODERATE | Commensal/Pathogenic | Diverse genetic background for horizontal gene transfer |
| **ST117** | MODERATE | APEC | Avian pathogenic, poultry industry concern |
| **ST69** | HIGH | Hybrid UPEC/EAEC | Uropathogenic/diarrheagenic hybrid |
| **ST73** | HIGH | Classic UPEC | Community-associated UTIs, high virulence |
| **ST88** | HIGH | NMEC/ExPEC | Meningitis-associated, less common than ST95 |

### **Accessing the Lineage Database**
The lineage database is automatically generated during analysis and can be found at:
```
lineage_results/ecoli_comprehensive_reference.html
```

This interactive HTML file provides:
- **Search functionality** by sequence type, serotype, or resistance profile
- **Risk categorization** (HIGH, MODERATE, LOW)
- **Geographical distribution** maps
- **Treatment recommendations** based on resistance profiles
- **Key references** for each lineage

---

## ⚡ **Performance Benchmarks**

| Scenario | Genomes | Time | Hardware | Speed per Genome |
|----------|---------|------|----------|------------------|
| Standard Workstation | 30 genomes | 80-150 min | 2 CPU cores, 8GB RAM | 3-6 min |
| High-Performance Server | 30 genomes | **41 min** | 16 CPU cores, 16GB RAM | **1.2 min** |
| Single Genome | 1 genome | 1-6 min | Variable | - |

### **Validation Accuracy**
- **100% concordance** with standalone reference tools (mlst, SerotypeFinder, ezClermont)
- **Perfect typing** of reference strains (K-12 MG1655, O157:H7, O18ac:H7)
- **Robust performance** across diverse clinical and reference isolates

---

## 🆚 **Competitive Comparison**

| Feature | EcoliTyper | ECTyper | Bactopia | Mykrobe |
|---------|------------|---------|----------|---------|
| **Primary Focus** | *E. coli* integrated genotyping | *E. coli* serotyping | Multi-species generalist | AMR prediction |
| **MLST** | ✅ Achtman scheme | ❌ | ✅ | ❌ |
| **Serotyping** | ✅ O:H (SerotypeFinder) | ✅ | Limited | ❌ |
| **CH Typing** | ✅ *fumC/fimH* | ❌ | ❌ | ❌ |
| **Clermont Phylogrouping** | ✅ 2013 scheme | ❌ | ✅ | ❌ |
| **AMR Profiling** | ✅ ABRicate + AMRFinderPlus | Limited | ✅ AMRFinder | ✅ Core function |
| **Virulence Screening** | ✅ 9 databases | Shiga toxins only | Limited | ❌ |
| **Cross-genome Analysis** | ✅ Automated pattern discovery | ❌ | ❌ | ❌ |
| **Lineage Database** | ✅ Curated high-risk clones | ❌ | ❌ | ❌ |
| **Output Formats** | HTML, TSV, JSON, text | Various | Various | Various |
| **Installation** | ⚡ Single Conda package | Moderate | Complex (Nextflow) | Simple |
| **Typing Speed (30 genomes)** | **41 minutes** | N/A | ~120 minutes | N/A |

**Reference Tools:**
- **Mykrobe:** [https://github.com/Mykrobe-tools/mykrobe](https://github.com/Mykrobe-tools/mykrobe)
- **Bactopia:** [https://github.com/bactopia/bactopia](https://github.com/bactopia/bactopia)
- **ECTyper:** [https://github.com/phac-nml/irida-plugin-ectyper](https://github.com/phac-nml/irida-plugin-ectyper)

---

## 📚 **Citation**

If you use EcoliTyper in your research, please cite:

```bibtex
@software{beckley2025ecolityper,
  title = {EcoliTyper: A species-optimized computational pipeline for comprehensive genotyping and surveillance of Escherichia coli},
  author = {Beckley, B. and Amarh, V.},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/bbeckley-hub/EcoliTyper}},
  doi = {10.5281/zenodo.17761775}
}
```

### **Third-Party Tool Citations**
EcoliTyper integrates several third-party tools. Please cite them when using corresponding modules:

**Serotyping & CH Typing**
```bibtex
@article{joensen2015rapid,
  author = {Joensen, K. G. et al.},
  title = {Rapid and easy in silico serotyping of Escherichia coli using whole genome sequencing data},
  journal = {Journal of Clinical Microbiology},
  year = {2015}
}

@article{roer2018chtyper,
  author = {Roer, L. et al.},
  title = {CHTyper, a web tool for subtyping of extraintestinal pathogenic Escherichia coli},
  journal = {Journal of Clinical Microbiology},
  year = {2018}
}
```
**ABRicate & MLST (Torsten Seemann)**
```bibtex
@software{seemann_abricate_2018,
  author = {Seemann, T.},
  title = {ABRicate: Mass screening of contigs for antimicrobial resistance and virulence genes},
  year = {2028},
  publisher = {GitHub},
  url = {https://github.com/tseemann/abricate}
}
@software{seemann_mlst_2018,
  author = {Seemann, T.},
  title = {MLST: Scan contig files against traditional PubMLST typing schemes},
  year = {2018},
  publisher = {GitHub},
  url = {https://github.com/tseemann/mlst}
}

```
**AMR (NCBI)**
```bibtex
@article{feldgarden2019validating,
  author = {Feldgarden, M. et al.},
  title = {Validating the AMRFinder Tool and Resistance Gene Database},
  journal = {Antimicrobial Agents and Chemotherapy},
  year = {2019}
}
```

**Phylogrouping**
```bibtex
@article{waters2020easy,
  author = {Waters, N. R. et al.},
  title = {Easy phylotyping of Escherichia coli via the EzClermont web app},
  journal = {Access Microbiology},
  year = {2020}
}
```

---

## ❓ **Frequently Asked Questions**

### **General Questions**

**Q: What makes EcoliTyper different from other typing tools?**
A: EcoliTyper is specifically optimized for *E. coli* and integrates 7 complementary typing methods into a single pipeline with automated cross-genome pattern discovery and a curated lineage database for epidemiological context.

**Q: Can I use EcoliTyper for other bacterial species?**
A: No, EcoliTyper is specifically optimized for *Escherichia coli*. The algorithms, thresholds, and databases are tailored for this species.

### **Installation & Setup**

**Q: How much disk space is required?**
A: Approximately 5-10 GB for the Conda environment and databases. Additional space is needed for input genomes and output files.

### **Analysis & Results**

**Q: How accurate is EcoliTyper compared to standalone tools?**
A: EcoliTyper shows **100% concordance** with standalone reference tools (mlst, SerotypeFinder, ezClermont) for standard typing methods on validated reference strains.

**Q: What should I do if I find a novel sequence type not in the database?**
A: Please report it as a GitHub issue with supporting references. We actively maintain and expand the lineage database.

---

## 🤝 **Contributing**

We welcome contributions from the community! Here's how you can help:

1. 🍴 Fork the repository
2. 🌿 Create a feature branch (`git checkout -b feature/amazing-feature`)
3. 💾 Commit your changes (`git commit -m 'Add amazing feature'`)
4. 🚀 Push to the branch (`git push origin feature/amazing-feature`)
5. 🔔 Open a Pull Request

**Areas for Contribution:**
- Database expansion and curation
- Additional typing schemes
- Performance optimizations
- Visualization enhancements
- Documentation improvements

---

## 📜 **License & Third-Party Components**

### **EcoliTyper Core Code**
The EcoliTyper pipeline code (the workflow engine, report generation, HTML templates, and Python modules written by the authors) is licensed under the **MIT License** – see the [LICENSE](LICENSE) file for details.

### **Third-Party Tool Licenses**
EcoliTyper executes several external bioinformatics tools, which are installed as Conda dependencies. Each tool is the property of its respective developers and is used under its own license:

| Tool | License |
|------|---------|
| **MLST** (Torsten Seemann) | GPL v2 |
| **ABRicate** (Torsten Seemann) | GPL v2 |
| **AMRFinderPlus** (NCBI) | Public Domain |
| **SerotypeFinder** (CGE) | Apache 2.0 |
| **CH Typing databases** (CGE) | Free for research |
| **ezClermont** | MIT |

By using EcoliTyper, you agree to comply with the licenses of these third-party tools and databases.

---

## 👥 **Authors & Affiliations**

### **Primary Authors**
- **Brown Beckley** – *Creator & Lead Developer*
  Department of Medical Biochemistry, University of Ghana Medical School, Accra, Ghana
  Department of Biochemistry and Biotechnology, Kwame Nkrumah University of Science and Technology, Kumasi, Ghana
  📧 [brownbeckley94@gmail.com](mailto:brownbeckley94@gmail.com)

- **Dr. Vincent Amarh** – *Supervisor & Advisor*
  Department of Medical Biochemistry, University of Ghana Medical School, Accra, Ghana

---

## 🔮 **Future Development Roadmap**

### **Short-term Goals (2025)**
- Regular database updates
- Enhanced visualization capabilities
- Improved documentation and tutorials

### **Medium-term Goals (2026)**
- Integration with raw read analysis pipelines
- Real-time database update mechanisms
- Cloud deployment options (Docker, Singularity)

### **Long-term Vision**
- AI/ML models for predictive analytics
- Web interface for non-command-line users
- Expanded lineage database with global collaborations
- Integration with public health surveillance systems

---

<div align="center">

## **⭐ Star us on GitHub if you find EcoliTyper useful!**

*Transforming fragmented genomic surveillance into integrated public health intelligence* 🧬✨

**"From sequences to surveillance in one command"**

---

**Join the Fight Against Antimicrobial Resistance**

Antimicrobial resistance (AMR) represents one of the most significant global health threats of our time. We invite researchers, clinicians, and public health professionals to collaborate with us in expanding and validating our *E. coli* database, sharing regional epidemiological data, and advancing AMR surveillance.

**Together, we can enhance global AMR monitoring and develop more effective treatment strategies.**

</div>
