Withdraw
Loading…
Analysis and applications of k-mer based methods in bioinformatics
Greenberg, Grant C
Loading…
Permalink
https://hdl.handle.net/2142/121999
Description
- Title
- Analysis and applications of k-mer based methods in bioinformatics
- Author(s)
- Greenberg, Grant C
- Issue Date
- 2023-11-28
- Director of Research (if dissertation) or Advisor (if thesis)
- Shomorony, Ilan
- Doctoral Committee Chair(s)
- Shomorony, Ilan
- Committee Member(s)
- Milenkovic, Olgica
- Veeravalli, Venugopal V
- El-Kebir, Mohammed
- Department of Study
- Electrical & Computer Eng
- Discipline
- Electrical & Computer Engr
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Bioinformatics
- Metagenomics
- Genome Assembly
- Sequence Alignment
- Information Theory
- Abstract
- Sequence data is ubiquitous in various fields of biology including genomics, transcriptomics, and proteomics. Breaking down sequences into length-k substrings (k-mers) is often useful in many analyses. In this proposal, we discuss several applications of k-mer based methods in bioinformatics, including 1) metagenomics, the study of the genomic composition of microbial communities, 2) genome assembly, the process of reconstructing a full genome from DNA sequencing reads, and 3) sequence similarity estimation. Methods that use a small k-value often consider only the overall k-mer composition of a sequence, since all possible k-mers are likely to be well represented. Interestingly, researchers have empirically observed that genomes of different bacterial species can be distinguished based on their distribution of 4-mers, or tetranucleotide frequency (TNF). Based on this phenomenon, we present an application of information theory to metagenomics binning, in which we aim to cluster many sequences (called “contigs”) from a metagenomic sample according to their genome of origin. By modeling each species as a distinct Markov process, we show that, to achieve perfect binning, the length of the contigs must scale with the inverse of the Chernoff divergence rate between the two most similar species. Furthermore, our result implies that contigs should be binned using the KL divergence rate, as opposed to the Euclidean distance often used in practice. The metagenomic binning model turns out to be useful for a new application improving bacterial genome assembly. The complexity of genome assembly is largely due to large repeats. To detect misassemblies due to reverse-complemented repeats, we propose a statistical test based on the Markov model, that predicts whether two segments come from the same strand on genome. In most cases, the test neatly partitions the genome into two segments of roughly equal length, corresponding to the segments between the DNA replication origin and terminus. We show that, in several cases where this balanced partition is not observed, the test identifies a potential inverted misassembly, which is validated by the presence of a reverse-complemented repeat at the boundaries of the inversion. Our method identifies 32 potential misassemblies in NCBI’s GenBank database, several of which are further supported by a reassembly of the read data. Methods that use a larger k-value treat each k-mer as a distinct entity, for instance in sequence similarity estimation, where exact matching k-mers are used to comparing sequences. However, the choice of the parameter k controls an important tradeoff in the task of identifying alignments: larger k-values give greater confidence in the identification of alignments but can lead to many missing alignments, particularly in the presence of significant noise. In our third application of k-mers, we introduce a new similarity estimation method that is effectively independent of the choice of k, thereby obviating the tradeoff. When estimating the similarity between two reads, instead of simply checking whether k-mers match, one checks how “lexicographically similar” the LexicHash hashes are. In our experiments on 40 PacBio datasets, the area under the Precision-Recall Curves obtained by LexicHash had an average improvement of 20.9% over MinHash, a popular method used in many sequence alignment tools.
- Graduation Semester
- 2023-12
- Type of Resource
- Thesis
- Copyright and License Information
- Copyright 2023 Grant Greenberg
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…