Metadata-Version: 2.4
Name: tecap
Version: 0.3.1
Summary: 3' terminal exon capture diagnostics for long-read scRNA-seq
Project-URL: Homepage, https://github.com/FullLengthFanatic/tecap
Project-URL: Issues, https://github.com/FullLengthFanatic/tecap/issues
Author-email: Simone Picelli <simone.picelli@iob.ch>
License: MIT License
        
        Copyright (c) 2026 Simone Picelli
        
        Permission is hereby granted, free of charge, to any person obtaining a copy
        of this software and associated documentation files (the "Software"), to deal
        in the Software without restriction, including without limitation the rights
        to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
        copies of the Software, and to permit persons to whom the Software is
        furnished to do so, subject to the following conditions:
        
        The above copyright notice and this permission notice shall be included in all
        copies or substantial portions of the Software.
        
        THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
        IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
        FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
        AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
        LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
        OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
        SOFTWARE.
License-File: LICENSE
Keywords: APA,ONT,PacBio,internal-priming,long-read,polyA,scRNA-seq
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: MacOS
Classifier: Operating System :: POSIX :: Linux
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Requires-Python: >=3.9
Requires-Dist: intervaltree>=3.1
Requires-Dist: matplotlib>=3.5
Requires-Dist: numpy>=1.22
Requires-Dist: pysam>=0.22
Provides-Extra: dev
Requires-Dist: pytest-cov>=4.0; extra == 'dev'
Requires-Dist: pytest>=7.0; extra == 'dev'
Requires-Dist: ruff>=0.4; extra == 'dev'
Description-Content-Type: text/markdown

# tecap

[![tests](https://github.com/FullLengthFanatic/tecap/actions/workflows/test.yml/badge.svg)](https://github.com/FullLengthFanatic/tecap/actions/workflows/test.yml)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.19762736.svg)](https://doi.org/10.5281/zenodo.19762736)

3' terminal exon capture diagnostics for long-read single-cell RNA-seq.

`tecap` classifies long-read alignments by where their 3' end lands relative to the terminal exon (TE), its UTR, and a polyA site atlas. It decomposes capture failures into nine mechanism buckets (successful capture, truncation at a real polyA site, internal priming in the UTR, internal priming in the CDS, alternative polyadenylation, upstream-exon mispriming, intronic mispriming, downstream readthrough) and measures reference base composition downstream of each cleavage site to distinguish classic A-tract internal priming from moderate-A priming characteristic of saturating-local-concentration oligo-dT chemistries (10x GEM droplets, BD Rhapsody capture beads).

Designed for PacBio Iso-Seq / Kinnex and Oxford Nanopore cDNA BAMs. Direct-RNA sequencing is explicitly unsupported (no RT, no priming artifact to diagnose).

## Mechanisms

Every classified read lands in exactly one of nine buckets, defined by where
its 3' end falls relative to the terminal exon (TE), the TE's UTR / CDS, and
the nearest annotated PolyASite cluster.

| Bucket | What it means | Why it matters |
|---|---|---|
| **Captured** | 3' end in the TE; read covers >=50% of it. | Successful full-length capture of the mRNA 3' end; the goal of any 3'-end protocol. |
| **MechA-correct** | 3' end in the TE 3' UTR within +-25 bp of an annotated polyA cluster, but read covers <50% of TE. | Truncated transcript that nonetheless terminates at a real polyA site; common with degraded input or short-fragment library prep. |
| **MechA-internalUTR** | 3' end in the TE 3' UTR but not at any annotated polyA cluster. | Internal oligo-dT priming on an A-rich stretch in the UTR; classic mispriming signature. |
| **IP-TE-CDS** | 3' end inside the terminal exon's CDS portion. | Internal priming on the coding portion of the TE; strong mispriming signal. |
| **MechA-noCDS** | 3' end inside the TE of a non-coding gene. | Reported separately so the coding-gene buckets stay clean. |
| **MechB-APA** | 3' end upstream of the TE at an annotated polyA cluster on an upstream exon. | Alternative polyadenylation isoform; biological, not a mispriming artifact. |
| **MechB-exon** | 3' end on an upstream exon, no nearby polyA cluster. | Internal priming on an upstream exon. |
| **MechB-aspecific** | 3' end upstream of the TE in an intron or gene flank. | Pre-mRNA priming or off-target alignment. |
| **MechC** | 3' end downstream of the TE end. | Read-through, unannotated 3' UTR extension, or alignment artifact. |

The basecomp subcommand also splits Captured / MechA / MechB-APA reads by
whether their cluster carries a canonical AAUAAA-like hexamer (PAS+/-).

Run `tecap explain` to print these definitions on the terminal, or
`tecap explain --mechanism MechA-correct --format json` for a single entry.

## Reading the plots

- `{sample}_terminal_exon.png` — three panels: bucket fractions, read-length
  density (Captured vs MechA-correct), and rates by 3' UTR length bin.
  Mispriming bias concentrates in the long-UTR bins.
- `{sample}_mecha_scatter.png` — read length vs TE coverage for
  MechA-correct reads only; reads above the dashed coverage threshold get
  promoted to Captured.
- `{sample}_basecomp.png` — eight panels, one per bucket, showing %A in
  the reference window downstream of cleavage. **Grey band (30-50% A):**
  moderate-A priming. **Dashed line (>=60% A):** classical A-tract priming.
  Mispriming buckets enriched in the grey band but not past the dashed
  line are characteristic of saturating-local-concentration oligo-dT
  chemistries (10x GEM droplets, BD Rhapsody capture beads); free
  oligo-dT at standard concentrations (bulk Iso-Seq) mis-primes
  preferentially past the dashed line on classical A-tracts.
- `comparison_*.png` — same panels, multiple samples grouped on the same
  axes. Generated by `tecap compare` or `tecap report` (multi-sample mode).

## Install

```bash
pip install git+https://github.com/FullLengthFanatic/tecap@v0.3.0
```

Development install:

```bash
git clone https://github.com/FullLengthFanatic/tecap
cd tecap
pip install -e .[dev]
pytest
```

## Quick start

```bash
# Classify reads. References are auto-fetched on first run and cached
# under ~/.cache/tecap/GRCh38/.
tecap classify \
    --bam sample.bam \
    --genome GRCh38 \
    --gtf-version 45 \
    --sample S1 \
    --out-dir results/ \
    --threads 8 \
    --platform cdna-pacbio \
    --verbose

# Or pass references explicitly (no auto-download):
tecap classify \
    --bam sample.bam \
    --gtf gencode.v45.annotation.gtf.gz \
    --polya-sites atlas.clusters.3.0.GRCh38.GENCODE_42.bed.gz \
    --sample S1 --out-dir results/ --threads 8

# Measure base composition in the 20 nt window downstream of each cleavage site
tecap basecomp \
    --bam sample.bam \
    --genome GRCh38 \
    --gtf-version 45 \
    --fasta GRCh38.primary_assembly.genome.fa.gz \
    --sample S1 \
    --out-dir results/ \
    --threads 8 \
    --verbose

# Render a self-contained HTML report (per-sample)
tecap report \
    --classify-json results/S1_terminal_exon.json \
    --basecomp-json results/S1_basecomp.json \
    --out-html results/S1_report.html

# Cross-sample HTML report (space-separated paths)
tecap report \
    --classify-json results/A_terminal_exon.json results/B_terminal_exon.json \
    --basecomp-json results/A_basecomp.json results/B_basecomp.json \
    --out-html results/compare.html

# Print the mechanism glossary
tecap explain
tecap explain --mechanism MechA-correct --format json

# Cross-sample comparison plots only (no HTML)
tecap compare \
    --mode classify \
    --inputs results/A_terminal_exon.json,results/B_terminal_exon.json \
    --out-dir results/

# Fetch references explicitly (otherwise --genome handles this)
tecap download-atlas \
    --genome GRCh38 \
    --gtf-version 45 \
    --out-dir ref/
```

## Outputs

Per sample (classify):
- `{sample}_terminal_exon.json` — bucket counts, fractions, PAS split, UTR-length stratification, orientation sanity check, read-length medians.
- `{sample}_terminal_exon.png` — 3-panel summary plot.
- `{sample}_mecha_scatter.png` — read length vs TE coverage for MechA-correct reads.
- `{sample}_tecap_mqc.json` — MultiQC custom-content table (auto-detected by the `_mqc.json` suffix).
- `{sample}_per_gene.tsv` (optional, with `--per-gene-table`) — per-gene bucket counts.

Per sample (basecomp):
- `{sample}_basecomp.json` — %A histograms per bucket, medians, >=60% and 30-50% fractions.
- `{sample}_basecomp.png` — 8-panel histogram grid.

Cross-sample:
- `comparison_terminal_exon.png` — grouped bars across samples.
- `comparison_basecomp.png` — per-bucket histogram overlays.

### Example plots

Outputs from a 4-sample run on `tecap compare`: 10x Kinnex
(`10x_FL_v02_full`), BD Rhapsody Kinnex (`BD46_FS_SEQ`), PacBio Kinnex
bulk cerebellum, PacBio Kinnex bulk heart. All human GRCh38, all
sequenced as FL Kinnex / MAS-ISO / PacBio HiFi.

![Terminal-exon bucket fractions and UTR-bin MechA-correct rates across the four samples.](docs/example_plots/comparison_terminal_exon.png)

![%A in the 20 nt reference window downstream of each cleavage site, per bucket, for the four samples. Grey band: 30-50% A (moderate-A priming). Dashed line: 60% A (classical A-tract priming). MechB_aspecific shows the chemistry split: 10x and BD Rhapsody enriched in the grey band, bulk Iso-Seq enriched past the dashed line.](docs/example_plots/comparison_basecomp.png)

HTML report (`tecap report`):
- Self-contained `.html` per sample (and per comparison) with embedded
  PNGs, executive summary tiles, mechanism legend, per-bucket tables,
  PAS split, and UTR-length stratification. Single file, no JS.

## Citation

If you use `tecap`, please cite the GitHub release DOI (see `CITATION.cff`).

## License

MIT
