Metadata-Version: 2.4
Name: cellitac
Version: 1.0.6
Summary: Cell type Identification using Transcription factor Analysis and Chromatin accessibility
Author-email: "Rana H. Abu-Zeid" <ranahamed2111@gmail.com>, Olaitan Awe <laitanawe@gmail.com>, Syrus Semawule <semawulesyrus@gmail.com>, Emmanuel Aroma <emmatitusaroma@gmail.com>, Toheeb Jumah <jumahtoheeb@gmail.com>, Derek Reiman <dreiman@ttic.edu>
License: MIT
Project-URL: Homepage, https://github.com/omicscodeathon/cellitac/
Keywords: single-cell,scATAC-seq,scRNA-seq,multiome,cell-type-identification,transcription-factor,chromatin-accessibility,machine-learning
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: numpy>=1.24
Requires-Dist: pandas>=2.0
Requires-Dist: openpyxl>=3.1
Requires-Dist: rpy2>=3.5
Requires-Dist: scikit-learn>=1.3
Requires-Dist: xgboost>=2.0
Requires-Dist: imbalanced-learn>=0.11
Requires-Dist: sklearn-compat>=0.1.5
Requires-Dist: matplotlib>=3.7
Requires-Dist: seaborn>=0.12
Requires-Dist: plotly>=5.18
Requires-Dist: networkx>=3.1
Provides-Extra: dev
Requires-Dist: pytest>=7.0; extra == "dev"
Requires-Dist: pytest-cov; extra == "dev"
Requires-Dist: black; extra == "dev"
Requires-Dist: ruff; extra == "dev"
Dynamic: license-file

# cellitac

**Cell type Identification using Transcription factor Analysis and Chromatin accessibility**

A pipeline for processing Single-Cell ATAC + RNA Multiome data and classifying cell types using Machine Learning.

---

## What It Does

| Stage | Steps | Tools |
|-------|-------|-------|
| **Preprocessing** | RNA QC → normalization → cell-type annotation | Seurat + SingleR (R via rpy2) |
| **Preprocessing** | ATAC QC → TF-IDF → LSI | Signac (R via rpy2) |
| **Preprocessing** | RNA + ATAC integration → ML-ready CSVs | Pure Python |
| **ML** | Imbalance analysis → SMOTE → feature selection | scikit-learn, imbalanced-learn |
| **ML** | RF + XGBoost + SVM training & evaluation | scikit-learn, xgboost |
| **ML** | 19 plots + JSON report + XLSX | matplotlib, seaborn, networkx |

---

> ⚠️ **Note:** cellitac has been developed and tested on PBMC (Peripheral Blood Mononuclear Cells) multiome data. Performance on other cell types or tissues may vary.

## Requirements

Before installing cellitac, you need:

- Linux or macOS (Ubuntu 20.04+ recommended)
- Python 3.9 – 3.12 (not 3.13+)
- Conda / Miniconda ([download here](https://docs.conda.io/en/latest/miniconda.html))
- ~5 GB free disk space

---

## Installation

### Step 1 — Create a Conda environment

```bash
conda create -n cellitac python=3.11 -y
conda activate cellitac
```

### Step 2 — Install R and core R libraries via conda

```bash
conda install -c conda-forge r-base=4.4.3 -y

conda install -c conda-forge -c bioconda \
  r-matrix r-hdf5r rpy2 \
  bioconductor-summarizedexperiment \
  bioconductor-singlecellexperiment \
  bioconductor-genomicranges \
  bioconductor-delayedarray \
  bioconductor-biocsingular \
  bioconductor-biocneighbors \
  bioconductor-genomicalignments \
  bioconductor-genomicfeatures \
  bioconductor-rtracklayer \
  r-seurat \
  bioconductor-celldex \
  bioconductor-biovizbase -y
```

### Step 3 — Install remaining R packages (takes 10–30 min)

```bash
Rscript -e "install.packages('BiocManager', repos='https://cran.r-project.org')"

Rscript -e "BiocManager::install(c(
  'Seurat', 'Signac', 'SingleR', 'celldex',
  'EnsDb.Hsapiens.v75', 'biovizBase', 'data.table'
), ask=FALSE)"
```

### Step 4 — Install cellitac

```bash
pip install cellitac
```

### Step 5 — Verify installation

```bash
cellitac --help
```

If you see the help message, you are ready to go ✅

---

## Quick Start

### Download test data (PBMC 3k cells, ~560 MB)

```bash
mkdir -p ~/data && cd ~/data

wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_granulocyte_sorted_3k/pbmc_granulocyte_sorted_3k_filtered_feature_bc_matrix.h5
wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_granulocyte_sorted_3k/pbmc_granulocyte_sorted_3k_atac_fragments.tsv.gz
wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_granulocyte_sorted_3k/pbmc_granulocyte_sorted_3k_atac_fragments.tsv.gz.tbi
wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_granulocyte_sorted_3k/pbmc_granulocyte_sorted_3k_atac_peaks.bed
wget https://cf.10xgenomics.com/samples/cell-arc/2.0.0/pbmc_granulocyte_sorted_3k/pbmc_granulocyte_sorted_3k_per_barcode_metrics.csv
```

### Run the pipeline

```bash
conda activate cellitac
cellitac --input ~/data --output ~/results
```

---

## Full Dataset (PBMC 10k)

```bash
mkdir -p ~/data && cd ~/data

wget https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_unsorted_10k/pbmc_unsorted_10k_filtered_feature_bc_matrix.h5
wget https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_unsorted_10k/pbmc_unsorted_10k_per_barcode_metrics.csv
wget https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_unsorted_10k/pbmc_unsorted_10k_atac_fragments.tsv.gz
wget https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_unsorted_10k/pbmc_unsorted_10k_atac_fragments.tsv.gz.tbi
wget https://cf.10xgenomics.com/samples/cell-arc/1.0.0/pbmc_unsorted_10k/pbmc_unsorted_10k_atac_peaks.bed
```

> **Note:** cellitac auto-detects file names — your files do not need to follow the 10x naming convention.

---

## Usage

### Command Line

```bash
# Full pipeline (preprocessing + ML)
cellitac --input ~/data --output my_results

# Preprocessing only
cellitac-preprocess --input ~/data --output my_results

# ML only (if preprocessing already done)
cellitac-model --data my_results/python_ready_data --output my_results/ml
```

### Python API

```python
from cellitac import run_full_pipeline, run_preprocessing, run_model

# Full pipeline
run_full_pipeline(input_dir="~/data", output_dir="my_results")

# Preprocessing only
run_preprocessing(input_dir="~/data", output_dir_python="python_ready_data")

# ML only
run_model(data_dir="python_ready_data", output_dir="ml_results")

# Use the ML class directly
from cellitac.mainModel import scATACMLPipeline
pipeline = scATACMLPipeline(data_dir="python_ready_data", output_dir="ml_results")
pipeline.run_complete_pipeline()
```

---

## Input Files

| File | Extension | Required |
|------|-----------|----------|
| Feature-barcode matrix | `.h5` | ✅ Yes |
| ATAC fragments | `.tsv.gz` | ✅ Yes |
| Fragments index | `.tsv.gz.tbi` | ✅ Yes |
| Peaks BED file | `.bed` | ✅ Yes |
| Per-barcode QC metrics | `.csv` | ⭕ Optional |

---

## Output Files

| File | Description |
|------|-------------|
| `ml_pipeline_report.json` | Full JSON report |
| `model_performance_summary.csv` | Accuracy / F1 / AUC per model |
| `detailed_model_results.xlsx` | Per-class metrics, CV results |
| `model_performance_comparison.png` | Bar chart comparison |
| `confusion_matrices.png` | Confusion matrices |
| `class_distribution_analysis.png` | Cell type distribution |
| `class_balancing_comparison.png` | Before/after SMOTE |
| `feature_importance.png` | RF + XGBoost top 20 features |
| `simple_feature_heatmap.png` | Feature importance heatmap |
| `overfitting_analysis.png` | CV train vs validation |
| `learning_curves.png` | Learning curves per model |
| `performance_radar.png` | Radar chart |
| `feature_distributions.png` | Violin plots |
| `class_separation_pca.png` | PCA scatter |
| `basic_tf_network.png` | Feature–cell-type network |

---

## Package Structure

```
cellitac/
├── src/cellitac/
│   ├── __init__.py          # Public API
│   ├── config.py            # Parameters (paths, QC thresholds, ML hyperparams)
│   ├── pipeline.py          # run_preprocessing, run_model, run_full_pipeline
│   ├── preprocessing.py     # R preprocessing via rpy2
│   ├── mainModel.py         # scATACMLPipeline class (19-step ML pipeline)
│   ├── cli.py               # cellitac / cellitac-preprocess / cellitac-model
│   └── rscripts/
│       ├── team1_rna.R      # Seurat + SingleR
│       └── team2_atac.R     # Signac
├── tests/
│   └── test_model.py
├── pyproject.toml
└── README.md
```

---

## Troubleshooting

| Problem | Solution |
|---------|----------|
| `conda activate cellitac` not working | Run `conda init` then restart terminal |
| R packages fail to install | Make sure you installed from conda first (Step 2) before BiocManager (Step 3) |
| `hdf5r` error | Run `conda install -c conda-forge hdf5 r-hdf5r -y` |
| `peak_region_fragments not found` | Normal for some datasets — pipeline continues automatically |
| `slot` deprecated error | Make sure you have the latest cellitac version: `pip install --upgrade cellitac` |

---

## Tests

```bash
pip install cellitac[dev]
pytest tests/ -v
```

---

## Contributors

📧 **1. Rana H. Abu-Zeid** — ranahamed2111@gmail.com
📧 **2. Syrus Semawule** — semawulesyrus@gmail.com
📧 **3. Emmanuel Aroma** — emmatitusaroma@gmail.com
📧 **4. Toheeb Jumah** — jumahtoheeb@gmail.com
📧 **5. Derek Reiman, Ph.D.** — dreiman@ttic.edu
📧 **6. Olaitan I. Awe, Ph.D.** — laitanawe@gmail.com

---

## License

MIT
