The Gutenberg-HathiTrust Parallel Corpus: A Real-World Dataset for Noise Investigation in Uncorrected OCR Texts
Jiang, Ming; Hu, Yuerong; Worthey, Glen; Dubnicek, Ryan C.; Capitanu, Boris; Kudeki, Deren; Downie, J. Stephen
Loading…
Permalink
https://hdl.handle.net/2142/109695
Description
Title
The Gutenberg-HathiTrust Parallel Corpus: A Real-World Dataset for Noise Investigation in Uncorrected OCR Texts
Author(s)
Jiang, Ming
Hu, Yuerong
Worthey, Glen
Dubnicek, Ryan C.
Capitanu, Boris
Kudeki, Deren
Downie, J. Stephen
Contributor(s)
HathiTrust Research Center
Issue Date
2021-03-17
Keyword(s)
Parallel Text Dataset
Optical Character Recognition
Digital Library
Digital Humanities
Data Curation
Abstract
This paper proposes large-scale parallel corpora of English-language publications for exploring the effects of optical character recognition (OCR) errors in the scanned text of digitized library collections on various corpus-based research. We collected data from: (1) Project Gutenberg (Gutenberg) for a human-proofread clean corpus; and, (2) HathiTrust Digital Library (HathiTrust) for an uncorrected OCR-impacted corpus. Our data is parallel regarding the content. So far as we know, this is the first large-scale benchmark dataset intended to evaluate the effects of text noise in digital libraries. In total, we collected and aligned 19,049 pairs of uncorrected OCR-impacted and human-proofread books in six domains published from 1780 to 1993.
Publisher
iSchools
Type of Resource
text
Language
eng
Permalink
http://hdl.handle.net/2142/109695
Sponsor(s)/Grant Number(s)
HathiTrust and its member community
Copyright and License Information
Copyright 2021 is held by Ming Jiang, Yuerong Hu, Glen Worthey, Ryan C. Dubnicek, Boris Capitanu, Deren Kudeki, and J. Stephen Downie. Copyright permissions, when appropriate, must be obtained directly from the authors.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.