Metadata-Version: 2.4
Name: TEsmall
Version: 2.0.9
Summary: A pipeline for profiling TE-associated small RNAs
Home-page: https://www.mghlab.org/software/tesmall
Author: Wen-Wei Liao, Kat O'Neill, Molly Hammell
Author-email: mghcompbio@gmail.com
License: GPLv3
Keywords: TE transposable element small RNA
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Console
Classifier: Intended Audience :: Science/Research
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Natural Language :: English
Classifier: Operating System :: MacOS :: MacOS X
Classifier: Operating System :: POSIX :: Linux
Classifier: Operating System :: Unix
Classifier: Programming Language :: Python :: 2.7
Classifier: Programming Language :: Python :: 3
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
License-File: LICENSE
Requires-Dist: numpy>=1.11.3
Requires-Dist: scipy>=0.18.0
Requires-Dist: pandas>=0.18.1
Requires-Dist: matplotlib>=1.5.1
Requires-Dist: seaborn>=0.7.1
Requires-Dist: bokeh>=1.0.0
Requires-Dist: pysam>=0.9.1
Requires-Dist: pybedtools>=0.7.8
Requires-Dist: cutadapt>=1.10
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: home-page
Dynamic: keywords
Dynamic: license
Dynamic: license-file
Dynamic: requires-dist
Dynamic: summary

## TEsmall

Version 2.0.9

A pipeline for profiling TE-derived small RNAs.

Created by Wen-Wei Liao, Kat O'Neill & Molly Gale Hammell, March 2017

Contact: mghcompbio@gmail.com

### Install Miniconda 3 (Linux)

```
$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ bash Miniconda3-latest-Linux-x86_64.sh
```

### Setup channels

```
$ conda config --add channels conda-forge
$ conda config --add channels bioconda
```

### Install TEsmall

```
$ git clone https://github.com/mhammell-laboratory/TEsmall.git
$ cd TEsmall
$ conda env create -f environment.yaml -n TEsmall
$ conda activate TEsmall
$ python setup.py install
```

### How to run TEsmall

1. Before executing TEsmall, make sure you have activated the environment

	```
	$ conda activate TEsmall
	```

2. For example, you would like to apply TEsmall on 2 FASTQ files: `Parental_1.fastq.gz` and `DroKO_1.fastq.gz`

	```
	$ TEsmall -f Parental_1.fastq.gz DroKO_1.fastq.gz -l Parental DroKO
	```

3. When it's done, deactivate the environment

	```
	$ conda deactivate
	```
4. If you would like to specify the directory to which the genomes
   TEsmall uses for annotation are downloaded and read from, you can
   specify it at runtime using the `--dbfolder` parameter
	
	```
	$ TEsmall -f Parental_1.fastq.gz DroKO_1.fastq.gz -g hg19 -l
	Parental DroKO --dbfolder /path/to/another/folder/
	```
	The files used by TEsmall will be downloaded to/access from the
	`genomes` folder inside `/path/to/another/folder/`.
	
	The default location is `$HOME/TEsmall_db/`

### For more information

```
$ TEsmall -h
usage: TEsmall [-h] [-a STR] [-m INT] [-M INT] [-g STR] [--maxaln INT]
               [--mismatch INT] [-o STR [STR ...]] [-p INT] [-f STR [STR ...]]
               [-l STR [STR ...]] [--dbfolder STR] [--verbose INT] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -a STR, --adapter STR
                        Sequence of an adapter that was ligated to the 3' end.
                        The adapter itself and anything that follows is
                        trimmed. (default: TGGAATTCTCGGGTGCCAAGG)
  -m INT, --minlen INT  Discard trimmed reads that are shorter than INT. Reads
                        that are too short even before adapter removal are
                        also discarded. (default: 16)
  -M INT, --maxlen INT  Discard trimmed reads that are longer than INT. Reads
                        that are too long even before adapter removal are also
                        discarded. (default: 36)
  -g STR, --genome STR  Version of reference genome (default: hg38)
  --maxaln INT          Suppress all alignments for a particular read if more
                        than INT reportable alignments exist for it. (default:
                        100)
  --mismatch INT        Report alignments with at most INT mismatches.
                        (default: 0)
  -o STR [STR ...], --order STR [STR ...]
                        Annotation priority. (default: structural_RNA miRNA
                        hairpin exon TE intron piRNA_cluster)
  -p INT, --parallel INT
                        Parallel execute by INT CPUs. (default: 1)
  -f STR [STR ...], --fastq STR [STR ...]
                        Input in FASTQ format. Compressed input is supported
                        and auto-detected from the filename extension (.gz).
  -l STR [STR ...], --label STR [STR ...]
                        Unique label for each sample.
  --dbfolder STR        Custom location of TEsmall database folder (containing the "genomes" folder).
						DEFAULT: $HOME/TEsmall_db/

  --verbose INT         Set verbose level.
                        0: only show critical message
						1: show additional warning message
						2: show process information
						3: show debug messages.
						DEFAULT: 2
  -v, --version         show program's version number and exit
```

### Output files

Here are some brief explanations of the output files generated by TEsmall

#### Final output
```
count_summary.txt    -    This is the file containing the combined count table
                          of all libraries processed by TEsmall. This is typically
	                  the file you want to use for differential analysis.
report.html          -    HTML report of QC and annotation statistics
```

For the following files, they are generated for each library, using the `-l, --label`
parameter the user provided.

#### Preprocessing output
```
[label].trimmed1.fastq    -   FASTQ file after 3' adapter trimming
[label].cutadapt1.log     -   Cutadapt log from 3' adapter trimming
[label].trimmed2.fastq    -   FASTQ file after 3' & 5' adapter trimming
[label].cutadapt2.log     -   Cutadapt log from 5' adapter trimming
[label].bam               -   BAM output for reads that aligned to rRNA (in older versions)
[label].rRNA.bam          -   BAM output for reads that aligned to rRNA
[label].rRNA.log          -   Bowtie log for rRNA mapping
[label].rm_rRNA.fastq     -   FASTQ file depleted for rRNA reads
                              Used for subsequent analysis
```

#### Genome alignment output
```
[label].log               -   Bowtie log for genome alignment (in older versions)
[label].genome.log        -   Bowtie log for genome alignment
[label].unaligned.fastq   -   FASTQ containing reads that failed to align to genome
[label].exceeded.fastq    -   FASTQ containing reads that aligned too many times to genome
[label].rinfo             -   Length & alignment counts for each aligned read (in older versions)
[label].aligned.rinfo     -   Length & alignment counts for each aligned read
[label].multi.bam         -   BAM output for reads aligned to genome (in older versions)
[label].genome.bam        -   BAM output for reads aligned to genome
```

#### Identifying tRNA fragment (tRF)
[Schorn et al. 2017](https://www.cell.com/cell/fulltext/S0092-8674(17)30696-7)
```
[label].cca.fa                    -   FASTA file containing aligned reads terminating with CCA, with CCA tail cleaved
[label].tRNA.bam                  -   BAM output for CCA-trimmed reads that aligned to tRNA
[label].3trf.log                  -   Bowtie log for CCA-trimmed reads aligning to tRNA (in older versions)
[label].tRNA.log                  -   Bowtie log for CCA-trimmed reads aligning to tRNA
[label].unaligned.cca.fa          -   FASTA file containing CCA-trimmed reads that failed to align
[label].trna_for_intersect.bam    -   BAM file of CCA-trimmed reads that aligned to tRNA, converted to genomic coordinates
[label].3trf_free.bam             -   BAM file of reads aligned to genome that are not tRF
[label].3trf.bam                  -   BAM file of reads aligned to genome that are tRF
```

#### Annotation output
```
[label].anno                      -   Annotation of aligned reads that are not tRF
[label].3trf.struc.mapper.anno    -   tRF that annotated to structural RNA (e.g. tRNA)
[label].3trf.TE.mapper.anno       -   tRF that annotated to TE
[label].comp                      -   Length distribution of reads based on annotation (in older versions)
[label].anno.rlen.info            -   Length distribution of reads based on annotation
[label].bedgraph                  -   BEDgraph of annotated reads weighted by EM
```

### Copying & distribution

TEsmall is part of [TEToolkit suite](https://www.mghlab.org/software).

TEsmall is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but *WITHOUT ANY WARRANTY*; without even the implied warranty of
*MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE*.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with TEsmall.  If not, see [this website](http://www.gnu.org/licenses/).

### Citation

If using the software in a publication, please cite the [following](https://pubmed.ncbi.nlm.nih.gov/30349559/):

O'Neill K, Liao WW, Patel A, Hammell MG. (2018) TEsmall Identifies Small RNAs Associated With Targeted Inhibitor Resistance in Melanoma. Front Genet. Oct 5;9:461.
