Abstract:
Background: High-throughput whole genome sequencing facilitates investigation of
minority virus sub-populations from virus positive samples. Minority variants are useful
in understanding within and between host diversity, population dynamics and can
potentially assist in elucidating person-person transmission pathways. Several minority
variant callers have been developed to describe low frequency sub-populations from
whole genome sequence data. These callers differ based on bioinformatics and statistical
methods used to discriminate sequencing errors from low-frequency variants. Methods:
We evaluated the diagnostic performance and concordance between published minority
variant callers used in identifying minority variants from whole-genome sequence data
from virus samples. We used the ART-Illumina read simulation tool to generate three
artificial short-read datasets of varying coverage and error profiles from an RSV
reference genome. The datasets were spiked with nucleotide variants at predetermined
positions and frequencies. Variants were called using FreeBayes, LoFreq, Vardict, and
VarScan2. The variant callers' agreement in identifying known variants was quantified
using two measures; concordance accuracy and the inter-caller concordance. Results:
The variant callers reported differences in identifying minority variants from the datasets.
Concordance accuracy and inter-caller concordance were positively correlated with
sample coverage. FreeBayes identified the majority of variants although it was
characterised by variable sensitivity and precision in addition to a high false positive rate
relative to the other minority variant callers and which varied with sample coverage.
LoFreq was the most conservative caller. Conclusions: We conducted a performance and
concordance evaluation of four minority variant calling tools used to identify and
quantify low frequency variants. Inconsistency in the quality of sequenced samples
impacts on sensitivity and accuracy of minority variant callers. Our study suggests that
combining at least three tools when identifying minority variants is useful in filtering
errors when calling low frequency variants.