Withdraw
Loading…
The influence of optical character recognition quality on the robustness of semantic encoding
Jiang, Ming
Loading…
Permalink
https://hdl.handle.net/2142/117739
Description
- Title
- The influence of optical character recognition quality on the robustness of semantic encoding
- Author(s)
- Jiang, Ming
- Issue Date
- 2022-10-27
- Director of Research (if dissertation) or Advisor (if thesis)
- Downie, J. Stephen
- Doctoral Committee Chair(s)
- Downie, J. Stephen
- Committee Member(s)
- Renear, Allen
- Underwood, Ted
- Kilicoglu, Halil
- LeBlanc, Zoe
- Department of Study
- Illinois Informatics Institute
- Discipline
- Informatics
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Optical Character Recognition
- Word Embeddings
- Semantic Encoding
- Large Language Models
- Robustness
- HathiTrust
- Digital Humanities
- Digital Libraries
- Data Curation
- Abstract
- Historical textual collections, digitized by machine scanning and optical character recognition (OCR), offer unique opportunities for exploring and disseminating heritage knowledge. Research innovations in this field, including recent advances in natural language processing (NLP), have been widely promoted as promising new tools for supporting research on these collections. Unfortunately, the inevitable OCR noise in these digitized materials challenges the performance of advanced NLP techniques, which are generally built for born-digital corpora. Moreover, the black-box NLP further makes it hard to understand the effects of OCR errors on NLP algorithms. This dissertation concentrates on the problem mentioned above, with a specific focus on the robustness of word embedding techniques such as word2vec, BERT, etc. for semantic encoding of OCR'd texts. We explore the problem through three interrelated parts of the studies. The first two parts compare various word embedding technologies to capture their latent characteristics on texts with OCR quality issues; Part I examines document-level encoding; Part II investigates sentence- and word-level encoding. Finally, the last part analyzes the effect of different levels of OCR noise on a specific word embedding methodology. Experimental results show that: (1) fine-tuned BERT outperforms pre-trained BERT when encoding OCR'd texts; (2) BERT-based dynamic embeddings are more sensitive to OCR errors than static embeddings in encoding words and sentences; (3) coarse-grained encoding (e.g., document-level) mitigates OCR noise interference on word embeddings, while fine-grained encoding (e.g., word-level) reduces the robustness of word embeddings to OCR noise; (4) OCR noise in unseen testing data can reduce embedding performance and downstream outcomes, while noise in the training corpus can benefit embedding robustness; and, (5) OCR noise does matter in scientific relation classification. Following our results, we recommend that scholars analyze their data with regard to both text granularity and data quality in training and testing corpora, in order to select the appropriate embedding tool for their analyses.
- Graduation Semester
- 2022-12
- Type of Resource
- Thesis
- Copyright and License Information
- Copyright 2022 Ming Jiang
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…