Withdraw
Loading…
Label Annotation through Biodiversity Enhanced Learning
Heidorn, P. Bryan; Zhang, Qianjin
Loading…
Permalink
https://hdl.handle.net/2142/42056
Description
- Title
- Label Annotation through Biodiversity Enhanced Learning
- Author(s)
- Heidorn, P. Bryan
- Zhang, Qianjin
- Contributor(s)
- Chong, Steven
- Issue Date
- 2013-02
- Keyword(s)
- OCR
- parsing
- semantic markup
- digital curation and preservation
- information retrieval
- machine learning
- Abstract
- The LABELX (Label Annotation through Biodiversity Enhanced Learning) is an extension of the HERBIS NLP system reported previously (Heidorn & Wei, 2008). The objective of the system is to formaly structure output from Optical Character Recognition (OCR) of the highly variable labels of natural history museum specimens. OCR errors are common in the OCR output. Genus and species names are particularly prone to errors. Records are preprocessed using a fuzzy-match algorithm to find and replace genus and species names, including those with OCR errors, and replace those with a constant token. Integers and strings that begin with Alphabetic characters and end with numbers are also replaced with tokens. LABELX generates structured XML data and RDF and makes corrections to OCR errors in some fields. The main algorithm is a Hidden Markov Model (HMM). This poster reports an enhancement to the previous system with a larger data set.
- Publisher
- iSchools
- Type of Resource
- text
- Language
- en
- Permalink
- http://hdl.handle.net/2142/42056
- DOI
- https://doi.org/10.9776/13450
- Copyright and License Information
- Copyright © 2013 is held by the authors. Copyright permissions, when appropriate, must be obtained directly from the authors.
Owning Collections
Manage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…