1 Approach

  • Split the input bam file into smaller chunks using GNU parallel
  • Process each chunk to extract Hi-C non-chimeric reads (i.e. non-conforming Hi-C pairs)
  • Generate a fastq file to map to the TE consensus assembly
  • If the remap option is specified, the reads that do not map to the consensus are further mapped to the polymorphic sequences of the TE families
  • Clusters of mates are identified such that the part of the mate maps to the reference sequences while the other part clusters on the TE consensus assembly
  • Once the clusters are obtained, all the reads supporting TE insertion at the locus are modelled using a negative binomial model
  • The putative insertions are further filtered to allow reporting of the high confidence instances
  • A report file (*.candidate.insertion.bed) is generated in .bed format

2 Input parameters and processing steps

  • HiTEA was run using following parameters

I/O and paramters:
 -i                   --inputs           /n/data1/hms/dbmi/park/Dhawal/4DN/Data/GM12878.20x.bam
 -o                   --outprefix        gm12878_20xsb
 -w                   --workdir          /n/data1/hms/dbmi/park/Dhawal/4DN/Data/gm12878_20x/sandbox
 -m                   --enzyme           MboI
 -sites               --sites            /n/data1/hms/dbmi/park/Dhawal/Genomes/Hg38/hg38_DpnII.txt
 -index               --index            hg38/indexC/Hg_TEConsensus.fa
 -repbase             --repbase          hg38/repbase/HumanRepbase.fasta 
 -x                   --wgs              F
 -remap               --remap            T 
 -bgAnnotations       --bgAnnotations    hg38/bgAnnotations
 -chr_sizes           --chrsize          
 -anchor_mapq         --anchor_mapq      28
 -clip                --clip             20
 -gap                 gap                2
 -chunksize           chunk size         1995M
 -ncores (available)  --ncores           28
 -ncores (used)       28
 -threads (available) 392
 -threads (used)      8

command: hitea -p 28 -m MboI -q 28 -o gm12878_20xsb -s 20 -g hg38 -w /n/data1/hms/dbmi/park/Dhawal/4DN/Data/gm12878_20x/sandbox -i /n/data1/hms/dbmi/park/Dhawal/4DN/Data/GM12878.20x.bam


Start-Time 00:00:00
# Step1 : checking dependancies ... 
great!! all dependancies are present!

# Step2 : Splitting input file and generating fastq records ... 
Completed parsing input file
Time-elapsed 00:00:01
Completed merging back the input parts
Time-elapsed 00:00:01

# Step3 : Extracting clusters  ... 
Completed extracting cluster locations
Time-elapsed 00:00:01

# Step4 : Sorting/merging and cleaning up ...
Completed sorting/merging
Time-elapsed 00:00:01

# Step5 : Merging breakpoints and generating report...
Error in parsing step 

3 Summary of the read pairs and their flags

  • Read types are classified using pairtools.
  • While, the internal read classification uses following nomenclature
    • DE = Clipped reads where the clipped segments are at least ‘-clip’ parameter long
    • IE = Reads that do not define the clustur locus but havbe their mates mapping to the TE consensus. In other words these are Repeat Anchored Mates (RAM, as defined by ref-1)
    • TP = Reads that do NOT carry RE ligation motifs
    • FP = Reads that carry RE ligation motif
## ##-----------------------------------------------------------------------------------------------------
## Total read-pairs in the file 598214830
## Total "cis" read-pairs in the file   420474901 (70.29%)
## Total "trans" read-pairs in the file 177374422 (29.65%)
## Linearly Mapped and rescued read-pairs (UU/UR/RU)    463030428 (77.40%)
## Uninformative read-pairs 33115654 (5.54%)
## Low mapping quality read-pairs (MAPQ < 28)   5989176 (1.00%)
## Read-pairs with at least 2 Ns in both the read-sequences 431757 (0.0722%)
## Read-pairs mapping to non-selected chromosomes   2180041 (0.364%)
## Reads with only single mate mapped   105 (0.00%)
## WGS non-chimeric reads   61357746 (10.26%)
## Non-HiC chimeric read pairs  77904671 (13.02%)
## Pairs with 2 RE junctions    258035 (0.043%)
## 
## # Read/Flag-Class    (Inp.)      (Report)    
## # MM 9038343     454748 (5.03%)  
## # MR 17342463        16278877 (93.87%)   
## # MU 37843471        32089694 (84.80%)   
## # NM 835188      101920 (12.20%) 
## # NR 26070824        6654262 (25.52%)    
## # NU 11120295        10037730 (90.26%)   
## # RU 40445331        0 (0.00%)   
## # UR 40865170        0 (0.00%)   
## # UU 382556152       0 (0.00%)   
## # WW 31732086        26860361 (84.65%)   
## 
## # Flag-Class (R1,Inp.)   (R2,Inp.)   (R1,Report) (R2,Report)
## # DE,FP  20953931    20785312    11952718 (57.04%)   11387378 (54.79%)
## # DE,TP  6838240 12367067    6623026 (96.85%)    12066691 (97.57%)
## # IE,FP  55167304    49228441    0 (0.00%)   0 (0.00%)
## # IE,TP  51859420    52438075    31295209 (60.35%)   31843827 (60.73%)
## Total pairs used in the analyses:     92477592 (15.46%)
## Total %flags retained for reporting: 105168849   (39.00%) (out of 269637790)
## ##-----------------------------------------------------------------------------------------------------

4 GNU run logs

  • GNU parallel was used for splitting and piping the input file on all available cores
  • The submitted job logs are displayed below

5 Read-pair orientation QC

  • Characteristic plot depicting mate-pair orientations versus distance between their mapping
  • For more reference, refer [https://data.4dnucleome.org/] for more details

6 Background modeling information

  • Reads supporting an insertion (i.e. clipped reads at the insertion site and non-clipped reads in the 2kb window) are counted for all putative insertions
  • A set of 100,000 insertions points are randomly selected in the genome and above exercise is repeated.
  • Based on sequencing depth, the reads supporting insertion for randomly selected loci are modeleled using negative binomial function
  • Enrichment of reads supporting the insertion is sought for the putative candidates
  • Following figures display:
    • correlation between sequencing depth and reads supporting insertion
    • mean vs standard deviation relationship between the randomly selected loci
    • p-value distribution from the model

7 Insertion call Summary

  • HiTEA calls are summarized in the table below
  • Status 3: High confidence calls with both right and left hand side clip information
  • Status 2: High confidence calls with only single side clip inforamtion available
  • Status 1: Calls overlap with known genomic copies of a given transposable element
  • Status 0: Poor quality calls likely to be false positive (omit them from the analyses)

7.1 High confidence calls

7.2 Omitted clusters

8 Coverage summary along the HiTEA calls

  • Reads supporting the insertion are of two types
    • Clipped reads that define insertion breakpoint
    • Repeat Anchored Mates (RAM)
  • For each of the high confidence HiTEA call (i.e. status>=1), read coverage for above read-types is displayed.
  • The coverages are grouped by TEs
  • Color scale represent total number of reads on log2 scale (with an addition of a pseudocount of 1)

9 References:

  1. Lee, E., Iskow, R., Yang, L., Gokcumen, O., Haseley, P., Luquette, L. J., … Park, P. J. (2012). Landscape of somatic retrotransposition in human cancers. Science. http://doi.org/10.1126/science.1222077