Empirical accuracy bounds for next-generation sequencing variant calling workflows

Stephens, Zachary Daniel

Empirical accuracy bounds for next-generation sequencing variant calling workflows

Stephens, Zachary Daniel

Permalink

https://hdl.handle.net/2142/78801

Description

Title

Empirical accuracy bounds for next-generation sequencing variant calling workflows

Author(s)

Stephens, Zachary Daniel

Issue Date

2015-05-01

Department of Study

Electrical & Computer Eng

Discipline

Electrical & Computer Engr

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

M.S.

Degree Level

Thesis

Date of Ingest

2015-07-22T22:46:09Z

Keyword(s)

Next-Generation Sequencing (NGS) Accuracy Benchmarking
Next-Generation Error Analysis Toolkit (NEAT)
Next-Generation Sequencing (NGS) Accuracy Bounds

Abstract

"This thesis investigates the accuracy bounds imposed on alignment-based variant calling workflows due to inherent uncertainties introduced by sequencing platforms. In this work we will use simulated data to empirically quantify the maximum performance that can be expected for alignment and variant detection accuracy in a workflow. Short read sequencers are inherently incapable of producing reads that can be uniquely mapped to every position of the human reference genome, so errors are inevitable. We will analyze the repetitive content of several organisms, and estimate the maximum attainable alignment accuracy as a function of read length. Additionally, we will show that paired-end sequencing with large insert sizes (also referred to as ""mate-pair"" sequencing) is capable of mapping >99% of the human genome. We have developed a set of tools, NEAT (Next-generation Error Analysis Toolkit), which we use to create fault-injected genomic datasets. Our experiments utilize these datasets to showcase how the behavior of BWA and GATK workflows changes as a function of read lengths, error rates, quality scores, error types, and mutation types. We utilize these results to quantify the performance gains that can be expected by altering these properties of an NGS dataset. Our results highlight the sensitivity of alignment software to read lengths and error rates, and the sensitivity of variant callers to quality scores and structural variation."

Graduation Semester

2015-5

Type of Resource

text

Permalink

http://hdl.handle.net/2142/78801

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Electrical and Computer Engineering

Dissertations and Theses in Electrical and Computer Engineering

Empirical accuracy bounds for next-generation sequencing variant calling workflows

Stephens, Zachary Daniel

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Electrical and Computer Engineering

Log In