Analysis and applications of k-mer based methods in bioinformatics

Greenberg, Grant C

Analysis and applications of k-mer based methods in bioinformatics

Greenberg, Grant C

Permalink

https://hdl.handle.net/2142/121999

Description

Title

Analysis and applications of k-mer based methods in bioinformatics

Author(s)

Greenberg, Grant C

Issue Date

2023-11-28

Director of Research (if dissertation) or Advisor (if thesis)

Shomorony, Ilan

Doctoral Committee Chair(s)

Shomorony, Ilan

Committee Member(s)

Milenkovic, Olgica
Veeravalli, Venugopal V
El-Kebir, Mohammed

Department of Study

Electrical & Computer Eng

Discipline

Electrical & Computer Engr

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

Bioinformatics
Metagenomics
Genome Assembly
Sequence Alignment
Information Theory

Abstract

Sequence data is ubiquitous in various fields of biology including genomics, transcriptomics, and proteomics. Breaking down sequences into length-k substrings (k-mers) is often useful in many analyses. In this proposal, we discuss several applications of k-mer based methods in bioinformatics, including 1) metagenomics, the study of the genomic composition of microbial communities, 2) genome assembly, the process of reconstructing a full genome from DNA sequencing reads, and 3) sequence similarity estimation. Methods that use a small k-value often consider only the overall k-mer composition of a sequence, since all possible k-mers are likely to be well represented. Interestingly, researchers have empirically observed that genomes of different bacterial species can be distinguished based on their distribution of 4-mers, or tetranucleotide frequency (TNF). Based on this phenomenon, we present an application of information theory to metagenomics binning, in which we aim to cluster many sequences (called “contigs”) from a metagenomic sample according to their genome of origin. By modeling each species as a distinct Markov process, we show that, to achieve perfect binning, the length of the contigs must scale with the inverse of the Chernoff divergence rate between the two most similar species. Furthermore, our result implies that contigs should be binned using the KL divergence rate, as opposed to the Euclidean distance often used in practice. The metagenomic binning model turns out to be useful for a new application improving bacterial genome assembly. The complexity of genome assembly is largely due to large repeats. To detect misassemblies due to reverse-complemented repeats, we propose a statistical test based on the Markov model, that predicts whether two segments come from the same strand on genome. In most cases, the test neatly partitions the genome into two segments of roughly equal length, corresponding to the segments between the DNA replication origin and terminus. We show that, in several cases where this balanced partition is not observed, the test identifies a potential inverted misassembly, which is validated by the presence of a reverse-complemented repeat at the boundaries of the inversion. Our method identifies 32 potential misassemblies in NCBI’s GenBank database, several of which are further supported by a reassembly of the read data. Methods that use a larger k-value treat each k-mer as a distinct entity, for instance in sequence similarity estimation, where exact matching k-mers are used to comparing sequences. However, the choice of the parameter k controls an important tradeoff in the task of identifying alignments: larger k-values give greater confidence in the identification of alignments but can lead to many missing alignments, particularly in the presence of significant noise. In our third application of k-mers, we introduce a new similarity estimation method that is effectively independent of the choice of k, thereby obviating the tradeoff. When estimating the similarity between two reads, instead of simply checking whether k-mers match, one checks how “lexicographically similar” the LexicHash hashes are. In our experiments on 40 PacBio datasets, the area under the Precision-Recall Curves obtained by LexicHash had an average improvement of 20.9% over MinHash, a popular method used in many sequence alignment tools.

Graduation Semester

2023-12

Type of Resource

Thesis

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Analysis and applications of k-mer based methods in bioinformatics

Greenberg, Grant C

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Log In