Metadata-Version: 2.4
Name: peek-bio
Version: 0.1.0
Summary: Instant file previews for genomics data
Author: Patrick Wilson
License: MIT
Project-URL: Homepage, https://github.com/pwilson97/peek-bio
Project-URL: Repository, https://github.com/pwilson97/peek-bio
Project-URL: Issues, https://github.com/pwilson97/peek-bio/issues
Keywords: bioinformatics,genomics,bam,vcf,fastq,fasta,bed,gtf,csv,preview,cli
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Operating System :: OS Independent
Classifier: Environment :: Console
Requires-Python: >=3.8
Description-Content-Type: text/markdown
License-File: LICENSE
Provides-Extra: excel
Requires-Dist: openpyxl>=3.0; extra == "excel"
Provides-Extra: h5ad
Requires-Dist: anndata>=0.8; extra == "h5ad"
Provides-Extra: bigwig
Requires-Dist: pyBigWig>=0.3; extra == "bigwig"
Provides-Extra: bam
Requires-Dist: pysam>=0.20; extra == "bam"
Provides-Extra: all
Requires-Dist: openpyxl>=3.0; extra == "all"
Requires-Dist: anndata>=0.8; extra == "all"
Requires-Dist: pyBigWig>=0.3; extra == "all"
Requires-Dist: pysam>=0.20; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=6.0; extra == "dev"
Dynamic: license-file

# peek-bio

Instant file previews for genomics data. One command, any format.

```
pip install peek-bio
```

![peek demo](docs/demo.gif)

## What it does

Point `peek` at a file and get a structured summary: row counts, column types,
quality scores, variant stats, mapping rates, QC warnings. No scripts, no
notebooks, no googling command flags.

```
$ peek deseq2_results.csv

 deseq2_results.csv — >10,553 x 7 (CSV, comma-separated)
 ────────────────────────────────────────────────────────────────────
 Columns:
                   str    0610005C13Rik, 0610009B22Rik, ...  (1,000 unique)
   baseMean        float  3.92 … 1983.92  (median: 25.32, mean: 59.32)  ⡇⡀⡀⡀⡀⡀⡀⡀⡀⡀
   log2FoldChange  float  -3.29 … 3.60  (median: -0.02, mean: -0.04)    ⡀⡀⡀⡀⡀⡇⡄⡀⡀⡀⡀⡀
   lfcSE           float  0.11 … 1.23  (median: 0.35, mean: 0.40)       ⡄⡇⡆⡄⡄⡄⡀⡀⡀⡀⡀⡀
   stat            float  -5.94 … 8.10  (median: -0.06, mean: -0.11)    ⡀⡀⡀⡀⡇⡆⡄⡀⡀⡀⡀
   pvalue          float  5.46e-16 … 1.00  (median: 0.37, mean: 0.45)   ⡇⡄⡄⡄⡄⡄⡄⡀⡄⡄⡄⡄
   padj            float  3.42e-13 … 1.00  (median: 0.95, mean: 0.81)   ⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡇

 Missing:  pvalue (1)
```

```
$ peek NA12878.bam

 NA12878.bam — 61,614 reads (BAM, indexed)
 ────────────────────────────────────────────────────────────────────
 Reference:  3366 sequences, 3.2 Gb  [GRCh38 (with alts)]
 Reads:  60,749 mapped (98.6%), 865 unmapped
 Flags:  0.1% duplicates, 1.5% supplementary
 Paired:  yes (2×250 bp)
 Insert size:  mean 449  median 428  range 100–999  ⡀⡀⡀⡀⡆⡇⡄⡄⡄⡀⡀⡀
 Read groups:  3  (NA12878, NA12878, NA12878)
 Sort order:  coordinate
 Programs:  bwamem, MarkDuplicates, GATK ApplyBQSR
 MAPQ:  mean 55.3  median 60  ⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡇
```

```
$ peek ERR188273_chrX_1.fq.gz

 ERR188273_chrX_1.fq.gz — 30,531 reads, 2.3 Mb (FASTQ, Phred+33)
 ────────────────────────────────────────────────────────────────────
 Read length:  all 75 bp
 Quality:  mean Q36.7  median Q38  range Q2–Q41  ⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡀⡆⡇
 GC content:  48.9%
```

```
$ peek clinvar.vcf.gz

 clinvar.vcf.gz — 4,403,650 variants (VCF)
 ────────────────────────────────────────────────────────────────────
 Variants:  4,103,565 snps, 93,659 insertions, 194,377 deletions, 12,049 complexes
 Ts/Tv:  1.69
 FILTER:  4,403,650 PASS
 Chroms:  32 total — top: 1 (398,195), 2 (384,641), 17 (265,676)
```

```
$ peek filtered_feature_bc_matrix/matrix.mtx.gz

 matrix.mtx.gz (12.3 MB) — 8,421 cells x 33,538 genes (Matrix Market, coordinate, integer)
 ────────────────────────────────────────────────────────────────────
 Non-zero:  17,438,362 entries (93.8% sparse)
 Cells:  8,421
 Features:  33,538
 Mean nnz/cell:  2,071
 Feature types:  33,538 Gene Expression
 Companions:  barcodes.tsv.gz, features.tsv.gz
```

## Supported formats

**Core** (no extra dependencies):

| Format | Extensions |
|--------|-----------|
| CSV/TSV | `.csv`, `.tsv`, `.txt` |
| BED | `.bed`, `.narrowPeak`, `.broadPeak`, `.bedGraph` |
| FASTA | `.fa`, `.fasta` |
| FASTQ | `.fq`, `.fastq` |
| VCF | `.vcf`, `.vcf.gz` |
| MTX | `.mtx` |
| GTF/GFF | `.gtf`, `.gff`, `.gff3` |

**Optional** (install what you need):

| Format | Extensions | Install |
|--------|-----------|---------|
| SAM/BAM/CRAM | `.sam`, `.bam`, `.cram` | `pip install peek-bio[bam]` |
| Excel | `.xlsx`, `.xls` | `pip install peek-bio[excel]` |
| BigWig | `.bw`, `.bigwig` | `pip install peek-bio[bigwig]` |
| H5AD | `.h5ad` | `pip install peek-bio[h5ad]` |

Or install everything: `pip install peek-bio[all]`

Files with non-standard extensions (or no extension at all) are detected automatically from their content.

## Directory scan

Point `peek` at a folder to get an instant inventory of all genomics files:

```
$ peek data/

 data/ — 30 genomics files, 3.1 GB
 ────────────────────────────────────────────────────────────────────
   1 FASTA  all_ref_sva.fa
   11 BAM/SAM/CRAM  (3 indexed)
   1 VCF  candidate_EOPC_variants.vcf.gz
   4 BED  CEBPG.bed, ENCFF363RKC.bed, ...
   3 GTF/GFF  fimo_HP.gff, fimo_cobound.gff, gencode.v38.basic.annotation.gtf.gz
   1 BigWig  k562_MNase.bw
   2 H5AD  neurips_bmmc.h5ad, pbmc68k.h5ad
   2 Excel  Oct4_RS-matrix_Rep1-Apr-2021.xlsx, nature_genetics_supp.xlsx
   5 CSV/TSV
```

Detects FASTQ pairs (R1/R2), indexed BAMs, and skips hidden files.

## Paired FASTQ comparison

Give `peek` two FASTQ files and it automatically compares them side by side:

```
$ peek sample_R1.fq.gz sample_R2.fq.gz

 Paired FASTQ Comparison
 ────────────────────────────────────────────────────────────────────
                 sample_R1.fq.gz         sample_R2.fq.gz
          Reads  1,204,881               1,204,881               ✓
       Total bp  90.4 Mb                 90.4 Mb
    Read length  75 bp                   75 bp                   ✓
   Mean quality  Q36.7                   Q34.2                   ✓
     GC content  48.9%                   49.1%                   ✓
       Encoding  Phred+33                Phred+33                ✓
```

Mismatched read counts (broken pairing) are flagged with a QC warning.

## Explain mode

New to bioinformatics? Add `--explain` for plain-English annotations of every metric:

```
$ peek --explain variants.vcf

 Ts/Tv:  1.85
   ↳ Transition/transversion ratio. Transitions (A<>G, C<>T) are chemically
     favored over transversions (all other changes). Whole-genome ~2.0, exome ~2.8.
     A low ratio can indicate sequencing artifacts or contamination.

 MAPQ:  mean 55.3  median 60
   ↳ Mapping quality: confidence that each read is aligned to the correct position.
     60 = near-certain. 0 = equally likely at multiple locations. Below 20 = ambiguous.
```

## QC warnings

peek flags common issues automatically:

- Unusual GC content (outside 25-65%)
- High N content in assemblies (>20%)
- Low mean base quality in FASTQ (<Q20)
- Adapter contamination in FASTQ (>5%)
- Low mapping rate in BAM/SAM (<80%)
- Low MAPQ scores (<20 mean)
- High duplicate rate (>30%)
- Ts/Tv ratio out of range in VCF
- Low genotype rate in multi-sample VCF (<90%)
- No gene features or missing gene_id in GTF
- Single-chromosome GTF (possible subset)
- Columns with >50% missing data
- Mixed-type columns (numbers and strings mixed together)

## Usage

```
peek FILE [FILE ...]          # preview one or more files
peek DIRECTORY                # scan a folder for genomics files
peek -r DIRECTORY             # scan recursively (include subdirectories)
peek R1.fq R2.fq              # compare paired FASTQ files
peek https://example.com/f.vcf  # preview a file from a URL
peek --explain FILE           # add plain-English explanations
peek --head 20 FILE           # show 20 preview rows instead of 5
peek --no-color FILE          # plain text output (no ANSI colors)
peek --formats                # list all supported formats + install status
peek --version                # print version
```

Compressed files (`.gz`) are handled transparently. URLs are downloaded to a temp file and cleaned up automatically.

## License

MIT
