Improving neural networks for biological sequence analysis through problem-specific customizations

Ramachandran, Anand

Improving neural networks for biological sequence analysis through problem-specific customizations

Ramachandran, Anand

Permalink

https://hdl.handle.net/2142/124132

Description

Title

Improving neural networks for biological sequence analysis through problem-specific customizations

Author(s)

Ramachandran, Anand

Issue Date

2024-01-22

Director of Research (if dissertation) or Advisor (if thesis)

Chen, Deming

Doctoral Committee Chair(s)

Chen, Deming

Committee Member(s)

Lumetta, Steven S
Iyer, Ravishankar
Hasegawa-Johnson, Mark

Department of Study

Electrical & Computer Eng

Discipline

Electrical & Computer Engr

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

Deep Learning
Large Language Models
Biological Sequence Analysis
Protein
DNA

Abstract

The growth of sequencing technologies has endowed us with the ability to obtain a large number of biological sequences at the resolution of amino acid and nucleic acid sequences, such as proteins, RNA, and DNA. Coincidentally, in computer science, we have seen the swift development of Deep Learning systems with the ability to ingest large quantities of data resulting in large models capable of complex analytics, previously not possible. Deep Learning and Deep Neural Networks have consequently been applied to the analysis of biological sequences. Specifically, Deep Neural Networks have been applied to problems such as protein binding prediction, DNA variant calling, and generation of protein sequences, to name a few. Frequently, in biological sequence analysis, the Deep Learning solutions are based on neural network architectures and training methods that were first developed elsewhere, and the techniques are applied without any problem- or data-specific customizations to the domain of bioinformatics. This dissertation introduces bioinformatics-specific customizations for deep neural networks and develops solutions for common biological sequence analysis problems based on these customizations. These customizations span three areas - (i) probabilistic graphical modeling using neural networks (ii) the neural network architecture and (iii) training algorithms to align the neural network's objectives to its operational settings. These techniques are applied to three problems, namely modeling of DNA-protein binding, DNA variant calling, and advance forecasting of viral protein sequences in the midst of a viral pandemic. Traditional methods for DNA-protein binding carefully convert a description of the biological or chemical interactions involved in protein binding to mathematical formulations. These methods are outperformed by black-box Deep Neural Network models, which take a sequence as input and directly produce the binding preference as output. This dissertation examines leveraging the strengths of both approaches to model the problem of binding, which results in a new probabilistic graphical model termed the Long Short-term Graphical Model (LSGM). The model has the capability to learn long-term structure in the data, thanks to Deep Neural Networks, but at the same time can explicitly model the dynamics of protein binding, thanks to its links to the traditional approaches. The LSGM outperforms previous leading Deep Learning methods in the case of four out of five proteins that were empirically evaluated. Deep Neural Networks have also been applied to DNA variant calling. Arguably, the best among these approaches is DeepVariant, which repurposes an image-processing Deep Neural Network to variant calling by converting sequencing data to images. This dissertation introduces HELLO, a novel approach to variant calling. HELLO uses a Deep Neural Network architecture that is specifically tailored to the characteristics of genome sequencing data, rather than convert the data to another format. The new architecture results in a much smaller Deep Neural Network that outperforms DeepVariant when trained on the same data. For example, HELLO reduces the number of indel call errors by up to 18%, 55%, and 65% for Illumina, PacBio, and hybrid Illumina-PacBio variant calling respectively, compared to a similarly trained DeepVariant pipeline. In these cases, the HELLO models are between 7 and 14 times smaller. Forecasting future viral protein sequences in a viral pandemic is a very useful task benefiting advance preparation steps. Protein generation models based on Deep Learning exist in literature. These models are based on learning the distribution of known data, whereas the goal of forecasting requires learning the distribution of future sequences which differ in some respects to that of known sequences. This dissertation proposes a novel training approach called PandoGen which introduces a finetuning step for protein generation models that is not dependent exclusively on available training data, and teaches the models to forecast unknown, highly potent sequences. PandoGen is applied to the problem of modeling the SARS-CoV-2 Spike protein sequence. PandoGen forecasts 2x as many future sequences which are 5x as infectious compared to a model that is 30x larger. PandoGen forecasts tens of novel lineages whereas competing methods forecast almost none. PandoGen also forecasts important variants of the virus up to a month ahead of time. Based on these applications, two guidelines for designing Deep Neural Networks for biological sequence analysis are suggested. First, when there is a traditional, principled approach that already exists to solve a problem, it is a good practice to retain the structure of this solution and build a neural network around it. Second, when there is no traditional counterpart to a solution being pursued, and the solution is primarily enabled by advanced representation capabilities of large-parameter neural networks, the model architecture itself may be retained, but it is beneficial to examine ways to align the training objectives with the true goals of the modeling problem, rather than depend on standard practices of training the model on a predefined training set. It is hoped that the solutions presented in this dissertation add towards building more capable biological sequence models and contribute to the toolkit that bioinformaticians have at their fingertips.

Graduation Semester

2024-05

Type of Resource

Thesis

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Improving neural networks for biological sequence analysis through problem-specific customizations

Ramachandran, Anand

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Log In