The influence of optical character recognition quality on the robustness of semantic encoding

Jiang, Ming

The influence of optical character recognition quality on the robustness of semantic encoding

Jiang, Ming

Permalink

https://hdl.handle.net/2142/117739

Description

Title

The influence of optical character recognition quality on the robustness of semantic encoding

Author(s)

Jiang, Ming

Issue Date

2022-10-27

Director of Research (if dissertation) or Advisor (if thesis)

Downie, J. Stephen

Doctoral Committee Chair(s)

Downie, J. Stephen

Committee Member(s)

Renear, Allen
Underwood, Ted
Kilicoglu, Halil
LeBlanc, Zoe

Department of Study

Illinois Informatics Institute

Discipline

Informatics

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

Optical Character Recognition
Word Embeddings
Semantic Encoding
Large Language Models
Robustness
HathiTrust
Digital Humanities
Digital Libraries
Data Curation

Abstract

Historical textual collections, digitized by machine scanning and optical character recognition (OCR), offer unique opportunities for exploring and disseminating heritage knowledge. Research innovations in this field, including recent advances in natural language processing (NLP), have been widely promoted as promising new tools for supporting research on these collections. Unfortunately, the inevitable OCR noise in these digitized materials challenges the performance of advanced NLP techniques, which are generally built for born-digital corpora. Moreover, the black-box NLP further makes it hard to understand the effects of OCR errors on NLP algorithms. This dissertation concentrates on the problem mentioned above, with a specific focus on the robustness of word embedding techniques such as word2vec, BERT, etc. for semantic encoding of OCR'd texts. We explore the problem through three interrelated parts of the studies. The first two parts compare various word embedding technologies to capture their latent characteristics on texts with OCR quality issues; Part I examines document-level encoding; Part II investigates sentence- and word-level encoding. Finally, the last part analyzes the effect of different levels of OCR noise on a specific word embedding methodology. Experimental results show that: (1) fine-tuned BERT outperforms pre-trained BERT when encoding OCR'd texts; (2) BERT-based dynamic embeddings are more sensitive to OCR errors than static embeddings in encoding words and sentences; (3) coarse-grained encoding (e.g., document-level) mitigates OCR noise interference on word embeddings, while fine-grained encoding (e.g., word-level) reduces the robustness of word embeddings to OCR noise; (4) OCR noise in unseen testing data can reduce embedding performance and downstream outcomes, while noise in the training corpus can benefit embedding robustness; and, (5) OCR noise does matter in scientific relation classification. Following our results, we recommend that scholars analyze their data with regard to both text granularity and data quality in training and testing corpora, in order to select the appropriate embedding tool for their analyses.

Graduation Semester

2022-12

Type of Resource

Thesis

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

The influence of optical character recognition quality on the robustness of semantic encoding

Jiang, Ming

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Log In