Entropy-based machine learning algorithms applied to genomics and pattern recognition

Moon, Wooyoung

Entropy-based machine learning algorithms applied to genomics and pattern recognition

Moon, Wooyoung

Content Files

MOON-DISSERTATION-2019.pdf

Permalink

https://hdl.handle.net/2142/104838

Description

Title

Entropy-based machine learning algorithms applied to genomics and pattern recognition

Author(s)

Moon, Wooyoung

Issue Date

2019-04-16

Director of Research (if dissertation) or Advisor (if thesis)

Song, Jun S.

Doctoral Committee Chair(s)

Dahmen, Karin

Committee Member(s)

Kuehn, Seppe
Draper, Patrick

Department of Study

Physics

Discipline

Physics

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Date of Ingest

2019-08-23T19:51:55Z

Keyword(s)

Machine Learning, Decision Trees, Convolutional Filters, Genomics, Cancer, Entropy

Abstract

Transcription factors (TF) are proteins that interact with DNA to regulate the transcription of DNA to RNA and play key roles in both healthy and cancerous cells. Thus, gaining a deeper understanding of the biological factors underlying transcription factor (TF) binding specificity is important for understanding the mechanism of oncogenesis. As large, biological datasets become more readily available, machine learning (ML) algorithms have proven to make up an important and useful set of tools for cancer researchers. However, there remain many areas for potential improvements for these ML models, including a higher degree of model interpretability and overall accuracy. In this thesis, we present decision tree (DT) methods applied to DNA sequence analysis that result in highly interpretable and accurate predictions. We propose a boosted decision tree (BDT) model using the binary counts of important DNA motifs to predict the binding specificity of TFs belonging to the same protein family of binding similar DNA sequences. We then proceed to introduce a novel application of Convolutional Decision Trees (CDT) and demonstrate that this approach has distinct advantages over the BDT modeil while still accurately predicting the binding specificty of TFs. The CDT models are trained using the Cross Entropy (CE) optimization method, a Monte Carlo optimization method based on concepts from information theory related to statistical mechanics. We then further study the CDT model as a general pattern recognition and transfer learning technique and demonstrate that this approach can learn translationally invariant patterns that lead to high classification accuracy while remaining more interpretable and learning higher quality convolutional filters compared to convolutional neural networks (CNN).

Graduation Semester

2019-05

Type of Resource

text

Permalink

http://hdl.handle.net/2142/104838

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Physics

Dissertations in Physics

Entropy-based machine learning algorithms applied to genomics and pattern recognition

Moon, Wooyoung

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Physics

Log In