"This thesis investigates the accuracy bounds imposed on alignment-based variant calling workflows due to inherent uncertainties introduced by sequencing platforms. In this work we will use simulated data to empirically quantify the maximum performance that can be expected for alignment and variant detection accuracy in a workflow.
Short read sequencers are inherently incapable of producing reads that can be uniquely mapped to every position of the human reference genome, so errors are inevitable. We will analyze the repetitive content of several organisms, and estimate the maximum attainable alignment accuracy as a function of read length. Additionally, we will show that paired-end sequencing with large insert sizes (also referred to as ""mate-pair"" sequencing) is capable of mapping >99% of the human genome.
We have developed a set of tools, NEAT (Next-generation Error Analysis Toolkit), which we use to create fault-injected genomic datasets. Our experiments utilize these datasets to showcase how the behavior of BWA and GATK workflows changes as a function of read lengths, error rates, quality scores, error types, and mutation types. We utilize these results to quantify the performance gains that can be expected by altering these properties of an NGS dataset. Our results highlight the sensitivity of alignment software to read lengths and error rates, and the sensitivity of variant callers to quality scores and structural variation."
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.