Approach
- Split the input bam file into smaller chunks using GNU parallel
- Process each chunk to extract Hi-C non-chimeric reads (i.e. non-conforming Hi-C pairs)
- Generate a fastq file to map to the TE consensus assembly
- If the remap option is specified, the reads that do not map to the consensus are further mapped to the polymorphic sequences of the TE families
- Clusters of mates are identified such that the part of the mate maps to the reference sequences while the other part clusters on the TE consensus assembly
- Once the clusters are obtained, all the reads supporting TE insertion at the locus are modelled using a negative binomial model
- The putative insertions are further filtered to allow reporting of the high confidence instances
- A report file (*.candidate.insertion.bed) is generated in .bed format
Summary of the read pairs and their flags
- Read types are classified using pairtools.
- While, the internal read classification uses following nomenclature
- DE = Clipped reads where the clipped segments are at least ‘-clip’ parameter long
- IE = Reads that do not define the clustur locus but havbe their mates mapping to the TE consensus. In other words these are Repeat Anchored Mates (RAM, as defined by ref-1)
- TP = Reads that do NOT carry RE ligation motifs
- FP = Reads that carry RE ligation motif
## ##-----------------------------------------------------------------------------------------------------
## Total read-pairs in the file 598214830
## Total "cis" read-pairs in the file 420474901 (70.29%)
## Total "trans" read-pairs in the file 177374422 (29.65%)
## Linearly Mapped and rescued read-pairs (UU/UR/RU) 463030428 (77.40%)
## Uninformative read-pairs 33115654 (5.54%)
## Low mapping quality read-pairs (MAPQ < 28) 5989176 (1.00%)
## Read-pairs with at least 2 Ns in both the read-sequences 431757 (0.0722%)
## Read-pairs mapping to non-selected chromosomes 2180041 (0.364%)
## Reads with only single mate mapped 105 (0.00%)
## WGS non-chimeric reads 61357746 (10.26%)
## Non-HiC chimeric read pairs 77904671 (13.02%)
## Pairs with 2 RE junctions 258035 (0.043%)
##
## # Read/Flag-Class (Inp.) (Report)
## # MM 9038343 454748 (5.03%)
## # MR 17342463 16278877 (93.87%)
## # MU 37843471 32089694 (84.80%)
## # NM 835188 101920 (12.20%)
## # NR 26070824 6654262 (25.52%)
## # NU 11120295 10037730 (90.26%)
## # RU 40445331 0 (0.00%)
## # UR 40865170 0 (0.00%)
## # UU 382556152 0 (0.00%)
## # WW 31732086 26860361 (84.65%)
##
## # Flag-Class (R1,Inp.) (R2,Inp.) (R1,Report) (R2,Report)
## # DE,FP 20953931 20785312 11952718 (57.04%) 11387378 (54.79%)
## # DE,TP 6838240 12367067 6623026 (96.85%) 12066691 (97.57%)
## # IE,FP 55167304 49228441 0 (0.00%) 0 (0.00%)
## # IE,TP 51859420 52438075 31295209 (60.35%) 31843827 (60.73%)
## Total pairs used in the analyses: 92477592 (15.46%)
## Total %flags retained for reporting: 105168849 (39.00%) (out of 269637790)
## ##-----------------------------------------------------------------------------------------------------
GNU run logs
- GNU parallel was used for splitting and piping the input file on all available cores
- The submitted job logs are displayed below
Read-pair orientation QC
- Characteristic plot depicting mate-pair orientations versus distance between their mapping
- For more reference, refer [https://data.4dnucleome.org/] for more details

Insertion call Summary
- HiTEA calls are summarized in the table below
- Status 3: High confidence calls with both right and left hand side clip information
- Status 2: High confidence calls with only single side clip inforamtion available
- Status 1: Calls overlap with known genomic copies of a given transposable element
- Status 0: Poor quality calls likely to be false positive (omit them from the analyses)
Coverage summary along the HiTEA calls
- Reads supporting the insertion are of two types
- Clipped reads that define insertion breakpoint
- Repeat Anchored Mates (RAM)
- For each of the high confidence HiTEA call (i.e. status>=1), read coverage for above read-types is displayed.
- The coverages are grouped by TEs
- Color scale represent total number of reads on log2 scale (with an addition of a pseudocount of 1)

References:
- Lee, E., Iskow, R., Yang, L., Gokcumen, O., Haseley, P., Luquette, L. J., … Park, P. J. (2012). Landscape of somatic retrotransposition in human cancers. Science. http://doi.org/10.1126/science.1222077