Withdraw
Loading…
Algorithms for genomic variant identification and interpretation
Zhang, Chuanyi
This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.
Permalink
https://hdl.handle.net/2142/120362
Description
- Title
- Algorithms for genomic variant identification and interpretation
- Author(s)
- Zhang, Chuanyi
- Issue Date
- 2023-04-14
- Director of Research (if dissertation) or Advisor (if thesis)
- El-Kebir, Mohammed
- Ochoa, Idoia
- Doctoral Committee Chair(s)
- El-Kebir, Mohammed
- Ochoa, Idoia
- Committee Member(s)
- Milenkovic, Olgica
- Koyejo, Oluwasanmi
- Shomorony, Ilan
- Chia, Nicholas
- Department of Study
- Electrical & Computer Eng
- Discipline
- Electrical & Computer Engr
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- High-throughput sequencing
- Variant identification
- Intra-tumor heterogeneity
- Single nucleotide variants
- Variant filtering
- Copy number aberrations
- Graphical model
- Machine learning
- Pre-trained models
- Language model
- Abstract
- High-throughput sequencing has revolutionized the biology and medicine fields with its ability to generate an enormous amount of genomic sequencing data in a short time. However, the abundance of sequencing data also poses a challenge to accurately perform genomic variants identification, which is the basis for downstream analyses that interpret the impact of detected variants. In this thesis, we propose four problems in genomic variant identification and interpretation and develop four methods to solve them. A major challenge for accurate variant identification in tumors is that each tumor consists of various groups (or clones) of cells sharing distinct sets of variants. This heterogeneity makes it hard to detect variants with low frequency in sequencing data and distinguish them from sequencing artifacts. Despite the increasing availability of multi-sample tumor DNA sequencing data that holds the potential for more accurate variant calling, there is a lack of high-sensitivity multi-sample callers that can utilize these data. The first method we propose is Moss, a method to identify low-frequency single nucleotide variants (SNVs) that recur in multiple sequencing samples from the same tumor. Moss is a lightweight tool that can be used to extend existing single-sample SNV callers to support multiple samples. We show that Moss identifies new variants that were missed by previous methods, hence improving recall, while maintaining high precision in a simulation dataset and three real tumor datasets. This improved sensitivity enables more accurate downstream cancer genomics analyses. After variant calling, it is standard to perform a filtering step, as current pipelines for variant identification still contain many incorrectly called variants. Previous variant filtering methods rely on user-dependent filtering criteria or suffer from a long running time. The method we introduce, VEF, is a variant filtering tool that uses decision tree ensemble methods to identify and remove incorrectly called variants in genomic analysis pipelines. VEF treats filtering as a supervised learning problem by training on Genome in a Bottle (GIAB) curated reference variant call sets. Our results show that VEF consistently outperforms previous methods on whole genome sequencing datasets, and is robust to missing features and differences in coverage and sequencing pipelines. A crucial interpretation task is understanding the evolutionary process of tumors. The third method we demonstrate, Phertilizer, focuses on reconstructing tumor evolutionary history. The ultra-low coverage single-cell DNA sequencing (scDNA-seq) makes it possible to study copy number aberrations (CNAs) at a high resolution and is ideal for inferring evolutionary history since it captures variants in each cell instead of a mixture of different genotypes. However, the sparsity of coverage makes it unsuitable for SNVs and no current methods use both SNV and CNA to infer the clonal tree of the tumor. Phertilizer employs a probabilistic model that recursively partitions the data by identifying key evolutionary events in the history of the tumor. We demonstrate on simulations and two real datasets that Phertilizer uncovers clonal structure and genotypes more accurately compared to previous methods. Interpreting the effect and the impact on diseases of identified variants is another important downstream analysis. We propose the fourth method, MutBERT, a machine-learning method that enables accurate effect prediction. Large-scale pre-trained models (PTMs) are deep neural networks that have been trained on vast amounts of data to learn the underlying structure of the data source, such as natural language and images. PTMs have achieved close to optimal performance on natural language understanding tasks, and the advantage of PTMs also extends to biological sequences such as proteins and DNA. Although PTMs for DNA sequences are effective for predicting genomic features, they are limited in encoding genomic variants. MutBERT is a method dedicated to genomic variants incorporating Siamese networks and multi-task learning. MutBERT provides variant-aware embeddings and we demonstrate the improvement of accuracy on classification tasks from two real variant databases.
- Graduation Semester
- 2023-05
- Type of Resource
- Thesis
- Copyright and License Information
- Copyright 2023 Chuanyi Zhang
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…