MicroHapulator Report

Report generated at {{date}},
using MicroHapulator version {{mhpl8rversion}}.

Table of Contents

  1. Read QA/QC
  2. {% if "r1readlen" in plots %}
  3. Read Merging
  4. {% endif %}
  5. Read Mapping
  6. Haplotype Calling
  7. Genotype Calling

All statistics presented in this report are aggregated in a single table available at analysis/summary.tsv in the working directory. Full-resolution graphics for each figure are also available in each analysis/{samplename}/ subdirectory.

Read QA/QC

QC reports for the input reads are generated using FastQC. Links to reports for each sample are provided below.

NOTE: FastQC was designed for QC of whole-genome shotgun NGS reads prior to genome asssembly. A QC warning or failure for some modules (such as per-base sequence content or sequence duplication levels) may or may not be a concern with MH reads. Interpret results with care!
{% if "r1readlen" in plots %} {% for sample in samples %} {% endfor %}
SampleR1 ReportR2 Report
{{sample}} Click here to open in a new tab Click here to open in a new tab
{% else %} {% for sample in samples %} {% endfor %}
SampleReport
{{sample}} Click here to open in a new tab
{% endif %} {% if read_length_table is none %} {% if "r1readlen" in plots %}

The following histograms show the distribution of R1 and R2 read lengths for each sample.

{% for r1plot, r2plot in zip(plots["r1readlen"], plots["r2readlen"]) %} {% endfor %} {% else %}

The following histograms show read length distributions for each sample.

{% for rlplot in plots["readlen"] %} {% endfor %} {% endif %} {% else %}

Read lengths, uniform for all samples, are shown below.

{% if "r1readlen" in plots %} {% for i, row in read_length_table.iterrows() %} {% endfor %}
Sample Length R1 Length R2
{{ row.Sample }} {{ row.LengthR1 }} {{ row.LengthR2 }}
{% else %} {% for i, row in read_length_table.iterrows() %} {% endfor %}
Sample Length
{{ row.Sample }} {{ row.Length }}
{% endif %} {% endif %} {% if "r1readlen" in plots %}

Read Merging

Paired end reads are merged using FLASh.

{% for i, row in summary.iterrows() %} {% endfor %}
Sample Total Reads Merged Reads Merge Rate
{{row.Sample}} {{ "{:,}".format(row.TotalReads) }} {{ "{:,}".format(row.Merged) }} {{ "{:.2f}".format(row.MergeRate * 100) }}%

The following histograms show the distribution of merged read lengths for each sample.

{% for plot in plots["mergedreadlen"] %} {% endfor %} {% endif %}

Read Mapping

Merged reads are aligned to marker reference sequences using BWA MEM and formatted/sorted using SAMtools. The reads were also aligned to the full (entire chromosomes) human reference genome, to aid in discriminating between off-target sequences and e.g. contaminant sequences: reads that align to the entire chromosomes but not to the marker sequences represent off-target sequences, while reads that do not align to either are likely contaminants. The reported chi-square statistic is a measure of read coverage imbalance between markers, and can be compared among samples sequenced using the sample panel: the minimum value of 0 represents perfectly uniform coverage across markers, while the maximum value of D occurs when all reads map to a single marker (D represents the degrees of freedom, or the number of markers minus 1).

{% if "r1readlen" in plots %} {% else %} {% endif %} {% for i, row in summary.iterrows() %} {% if "r1readlen" in plots %} {% else %} {% endif %} {% endfor %}
SampleMerged ReadsTotal ReadsMapped Mapping Rate Mapped (Chrom) Mapping Rate (Chrom) Chi-square
{{row.Sample}}{{ "{:,}".format(row.Merged) }}{{ "{:,}".format(row.TotalReads) }}{{ "{:,}".format(row.Mapped) }} {{ "{:.2f}".format(row.MappingRate * 100) }}% {{ "{:,}".format(row.MappedFullRefr) }} {{ "{:.2f}".format(row.MappingRateFullRefr * 100) }}% {{ "{:.2f}".format(row.InterlocChiSq)}}

The following table shows the number of total reads mapped to each marker for each sample. Also shown is the number of off-target reads, determined by selecting the reads aligned to the marker, mapping them to the entire GRCh38 human reference sequence, and counting the reads that map preferentially to a locus outside the target marker locus.

Interpretation Hints

The "Mapping Rate (Chrom)" column in the table above indicates the proportion of the reads in the sample that map anywhere to the human genome. The percentage not mapped represents non-human material (typically contamination) in the sample.

The difference between the "Mapping Rate" column and the "Mapping Rate (Chrom)" column indicates the proportion of reads in the sample that are off-target with respect to the marker panel: their origin is elsewhere in the genome, and their presence is an artifact of an imperfect enrichment strategy.

In the table below, the "Off Target" columns refer to reads that can and do align to a marker reference sequence, but preferentially map elsewhere when aligned to the entire genome. These reads indicate repetitive content in the corresponding marker, which therefore requires extra scrutiny to ensure that reads from alternate loci are not jeopardizing the haplotype calling process. N/A values in the "Off Target" column indicate that the marker definition file did not include GRCh38 coordinates for all SNPs, which are required for off-target analysis.

{% for sample in typing_rates.keys() %} {% endfor %} {% for i in range(len(mapping_rates)) %} {% endfor %} {% for markername in markernames %} {% for sample_data in mapping_rates.values()%} {% if isna(sample_data.loc[markername, 'OffTargetReads']) %} {% else %} {% endif %} {% endfor %} {% endfor %}
{{sample}} {{sample}}
MarkerReads Off Target
{{markername}}{{"{:,d}".format(sample_data.loc[markername, 'ReadCount'])}}N/A{{"{:,d}".format(sample_data.loc[markername, 'OffTargetReads'])}}

The following histograms show the interlocus balance for each sample.

{% for plot in plots["locbalance"] %} {% endfor %}

Haplotype Calling

Haplotypes are called empirically on a per-read basis using mhpl8r type. Reads that span all SNPs of interest in the corresponding marker are examined; all other reads are discarded. The haplotype tallies represent a typing result for each sample.

{% for i, row in summary.iterrows() %} {% endfor %}
Sample Mapped Reads Typed Reads Typing Success Rate
{{row.Sample}} {{ "{:,}".format(row.Mapped) }} {{ "{:,}".format(row.Typed) }} {{ "{:.2f}".format(row.TypingRate * 100) }}%

The table above shows the aggregate typing rate across all markers. The table below shows the typing rate per marker.

{% for sample in typing_rates.keys() %} {% endfor %} {% for markername in markernames %} {% for sample_data in typing_rates.values()%} {% endfor %} {% endfor %}
Marker{{sample}}
{{markername}}{{"{:,.2f}".format(sample_data.loc[markername, 'TypingRate']*100)}}%

Genotype Calling

Two types of thresholds are applied to each typing result using mhpl8r filter to discriminate between true MH alleles (haplotypes) and false alleles resulting from sequencing error or other artifacts. A static detection threshold, based on a fixed number of reads, is used to filter out low-level noise. A dynamic analytical threshold, based on a percentage of the total reads at the locus (after removing alleles that fail the detection threshold), accounts for fluctuations in the depth of coverage between loci, samples, and runs, and can filter out higher-level noise in most cases. The haplotype tallies, after all filters have been applied, represents the genotype call for that sample.

Thresholds were were applied to each marker as shown in the table below. Click on marker names in the table below to see plots of haplotype talles and thresholds in the marker detail report.

{% for marker, row in thresholds.iterrows() %} {% endfor %}
Marker Detection Analytical
{{marker}} {{row.Static}} {{"{:.1f}".format(row.Dynamic * 100)}}%

For single-source samples, we expect the two alleles at heterozygous loci to have roughly even abundance. The following plots show the relative abundance of the major and minor allele for each marker with a heterozygous genotype (markers are sorted by absolute combined abundance, which is printed above each pair of allele counts). For known DNA mixtures, these plots can be safely ignored. But for suspected single-source samples, if there is substantial imbalance between major and minor allele counts at numerous loci, the sample should be examined more closely for the presence of a minor DNA contributor.

{% for plot in plots["hetbalance"] %} {% endfor %}