Referee perfomance tests
J. sinuosa transcriptome
The J. sinuosa reads for vegetative tissues were downloaded with fastq-dump from NCBI SRA SRX2676125. These reads were assembled with Trinity using default parameters:
Trinity --seqType fq --max_memory 100G --left SRR5380902_1.fastq --right SRR5380902_2.fastq --CPU 8
The reads were mapped back to the resulting transcriptome assembly:
bwa mem Trinity.fasta SRR5380902_1.fastq SRR5380902_2.fastq > SRR5380902.sam
A few additional steps were performed: 1) create the sequence dictionary to add to the SAM header because this output lacked the @HD line needed later on, 2) Add the header from the dict, 3) Convert to BAM, 4) Sort BAM.
Next, genotype likelihoods were calculated with ANGSD:
angsd -GL 2 -i SRR5380902-sorted.bam -ref Trinity.fasta -minQ 0 -doGlf 4 -out jsinuosa-angsd
And reference quality scores calculated with Referee:
referee.py -gl jsinuosa-angsd.glf.gz -ref Trinity.fasta --fastq --correct -o jsinuosa-referee
Additionally, two pileup files were created from the BAM file to test Referee’s genotype likelihood calculations. One with mapping quality and one without (the -s flag controls mapping quality output):
samtools mpileup -d 999999999 -f <reference.fa> -Q 0 -s -o jsinuosa-mq.pileup SRR5380902-sorted.bam
The pileup was used by Referee to calculate genotype likelihoods and quality scores:
referee.py -gl jsinuosa-mq.pileup -ref Trinity.fasta --pileup --fastq --correct -o jsinuosa-referee-glcalcs-mq
Here are some statistics from this process:
Description | Value |
---|---|
File size of assembled transcriptome | 53.96 MB |
# transcripts | 55,567 |
# bases | 50,283,993 |
Size of sorted BAM file | 1.01 GB |
Average read depth | 19.53 |
Time to run ANGSD | 7.98 minutes |
Sites with ANGSD calcs | 48,791,504 |
Time to run Referee (1 proc) | 18.69 minutes |
Size of Referee output (tab) | 1.67 GB |
Size of Referee output (fastq) | 124.97 MB |
Referee run time on J. sinuosa data
Figure 1: Referee runtime on J. sinuosa data with 3 different genotype likelihood methods

Referee memory usage on J. sinuosa data
Figure 2: Referee max memory use on J. sinuosa data with 3 different genotype likelihood methods

Comparing scores from different genotype likelihood methods on J. sinuosa data
Calculation method | \(L_{mismatch}\) = 0 (\(R_Q\) = 91) | Reference is N (\(R_Q\) = -1) | No reads mapped (\(R_Q\) = -2) |
---|---|---|---|
ANGSD | 1646845 | 0 | 1487227 |
Pileup without MQ | 1648828 | 0 | 1478554 |
Pileup with MQ | 4113594 | 0 | 1478554 |
Referee score distributions and read depth correlations
Figure 3: Referee score 1 distribution on J. sinuosa

Figure 4: Referee score 1 vs. read depth

Figure 5: ANGSD score 1 vs. Pileup score 1

Figure 5: ANGSD score 1 vs. Pileup score 2

Figure 6: Pileup score 1 vs. Pileup score 2

Using mapping quality
Figure 7: Referee score 1 distribution on J. sinuosa with mapping quality

Figure 8: Referee score 1 with mapping quality vs. read depth

Figure 9: Pileup score 1 NO mapping quality vs. Pileup score 1 WITH mapping quality

Owl monkey genome
The owl monkey reference genome was obtained from NCBI GCA_000952055.2_Anan_2.0 and the reads were mapped back to this assembly by colleagues at Baylor for the Owl monkey pedigree project. The BAM file was split by scaffold for parallelizaiton of ANGSD calcs. This splitting shouldn’t effect the Referee performance, though it means I don’t have genome-wide score distributions.
Here are some statistics from this genome:
Description | Value |
---|---|
File size of assembled genome | 2.90 GB |
# scaffolds | 22,922 |
# bases | 2,861,668,348 |
Size of sorted BAM file | 561.67 GB |
Average read depth | 152.988 |
Time to run ANGSD | 54.38 minutes |
Sites with ANGSD calcs | 2,664,619,299 |
Time to run Referee (1 proc) | 17.33 hours |
Size of Referee output (tab) | 7.6 GB |
Size of Referee output (fastq) | 608 MB |
\(L_{mismatch}\) = 0 (\(R_Q\) = 91) | 2,385,493,538 |
Reference is N (\(R_Q\) = -1) | 173,395 |
No reads mapped (\(R_Q\) | 197,049,049 |
Referee run time on Owl monkey and J. sinuosa data
Figure 12: Referee runtime on J. sinuosa and owl monkey data with 3 different genotype likelihood methods

Referee memory usage on Owl monkey and J. sinuosa data
Figure 13: Referee max memory use on J. sinuosa and owl monkey data with 3 different genotype likelihood methods
