Algorithms for genomic variant identification and interpretation

Zhang, Chuanyi

Algorithms for genomic variant identification and interpretation

Zhang, Chuanyi

Permalink

https://hdl.handle.net/2142/120362

Description

Title

Algorithms for genomic variant identification and interpretation

Author(s)

Zhang, Chuanyi

Issue Date

2023-04-14

Director of Research (if dissertation) or Advisor (if thesis)

El-Kebir, Mohammed
Ochoa, Idoia

Doctoral Committee Chair(s)

El-Kebir, Mohammed
Ochoa, Idoia

Committee Member(s)

Milenkovic, Olgica
Koyejo, Oluwasanmi
Shomorony, Ilan
Chia, Nicholas

Department of Study

Electrical & Computer Eng

Discipline

Electrical & Computer Engr

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

High-throughput Sequencing
Variant Identification
Intra-tumor Heterogeneity
Single Nucleotide Variants
Variant Filtering
Copy Number Aberrations
Graphical Model
Machine Learning
Pre-trained Models
Language Model

Language

eng

Abstract

High-throughput sequencing has revolutionized the biology and medicine fields with its ability to generate an enormous amount of genomic sequencing data in a short time. However, the abundance of sequencing data also poses a challenge to accurately perform genomic variants identification, which is the basis for downstream analyses that interpret the impact of detected variants. In this thesis, we propose four problems in genomic variant identification and interpretation and develop four methods to solve them. A major challenge for accurate variant identification in tumors is that each tumor consists of various groups (or clones) of cells sharing distinct sets of variants. This heterogeneity makes it hard to detect variants with low frequency in sequencing data and distinguish them from sequencing artifacts. Despite the increasing availability of multi-sample tumor DNA sequencing data that holds the potential for more accurate variant calling, there is a lack of high-sensitivity multi-sample callers that can utilize these data. The first method we propose is Moss, a method to identify low-frequency single nucleotide variants (SNVs) that recur in multiple sequencing samples from the same tumor. Moss is a lightweight tool that can be used to extend existing single-sample SNV callers to support multiple samples. We show that Moss identifies new variants that were missed by previous methods, hence improving recall, while maintaining high precision in a simulation dataset and three real tumor datasets. This improved sensitivity enables more accurate downstream cancer genomics analyses. After variant calling, it is standard to perform a filtering step, as current pipelines for variant identification still contain many incorrectly called variants. Previous variant filtering methods rely on user-dependent filtering criteria or suffer from a long running time. The method we introduce, VEF, is a variant filtering tool that uses decision tree ensemble methods to identify and remove incorrectly called variants in genomic analysis pipelines. VEF treats filtering as a supervised learning problem by training on Genome in a Bottle (GIAB) curated reference variant call sets. Our results show that VEF consistently outperforms previous methods on whole genome sequencing datasets, and is robust to missing features and differences in coverage and sequencing pipelines. A crucial interpretation task is understanding the evolutionary process of tumors. The third method we demonstrate, Phertilizer, focuses on reconstructing tumor evolutionary history. The ultra-low coverage single-cell DNA sequencing (scDNA-seq) makes it possible to study copy number aberrations (CNAs) at a high resolution and is ideal for inferring evolutionary history since it captures variants in each cell instead of a mixture of different genotypes. However, the sparsity of coverage makes it unsuitable for SNVs and no current methods use both SNV and CNA to infer the clonal tree of the tumor. Phertilizer employs a probabilistic model that recursively partitions the data by identifying key evolutionary events in the history of the tumor. We demonstrate on simulations and two real datasets that Phertilizer uncovers clonal structure and genotypes more accurately compared to previous methods. Interpreting the effect and the impact on diseases of identified variants is another important downstream analysis. We propose the fourth method, MutBERT, a machine-learning method that enables accurate effect prediction. Large-scale pre-trained models (PTMs) are deep neural networks that have been trained on vast amounts of data to learn the underlying structure of the data source, such as natural language and images. PTMs have achieved close to optimal performance on natural language understanding tasks, and the advantage of PTMs also extends to biological sequences such as proteins and DNA. Although PTMs for DNA sequences are effective for predicting genomic features, they are limited in encoding genomic variants. MutBERT is a method dedicated to genomic variants incorporating Siamese networks and multi-task learning. MutBERT provides variant-aware embeddings and we demonstrate the improvement of accuracy on classification tasks from two real variant databases.

Graduation Semester

2023-05

Type of Resource

Thesis

Handle URL

https://hdl.handle.net/2142/120362

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Electrical and Computer Engineering

Dissertations and Theses in Electrical and Computer Engineering

Algorithms for genomic variant identification and interpretation

Zhang, Chuanyi

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Electrical and Computer Engineering

Log In