Withdraw
Loading…
Automatic Metadata Extraction from Museum Specimen Labels
Heidorn, P. Bryan
Loading…
Permalink
https://hdl.handle.net/2142/9138
Description
- Title
- Automatic Metadata Extraction from Museum Specimen Labels
- Author(s)
- Heidorn, P. Bryan
- Contributor(s)
- Wei, Qin
- Issue Date
- 2008-09-24
- Keyword(s)
- Metadata
- Text Processing
- biological informatics
- machine learning
- Darwin Core
- Abstract
- This paper describes the information properties of museum specimen labels and machine learning tools to automatically extract Darwin Core (DwC) and other metadata from these labels processed through Optical Character Recognition (OCR). The DwC is a metadata profile describing the core set of access points for search and retrieval of natural history collections and observation databases. Using the HERBIS Learning System (HLS) we extract 74 independent elements from these labels. The automated text extraction tools are provided as a web service so that users can reference digital images of specimens and receive back an extended Darwin Core XML representation of the content of the label. This automated extraction task is made more difficult by the high variability of museum label formats, OCR errors and the open class nature of some elements. In this paper we introduce our overall system architecture, and variability robust solutions including, the application of Hidden Markov and Naïve Bayes machine learning models, data cleaning, use of field element identifiers, and specialist learning models. The techniques developed here could be adapted to any metadata extraction situation with noisy text and weakly ordered elements.
- Publisher
- Published by the Dublin Core Metadata Initiative and Universitätsverlag Göttingen 2008
- ISSN
- 1939-1358
- Type of Resource
- text
- Language
- en
- Permalink
- http://hdl.handle.net/2142/9138
- Copyright and License Information
- This work is protected by German Intellectual Property Right Law. It is also available as an Open Access version through the publisher’s homepage and the Online Catalogue of the State and University Library of Goettingen (http://www.sub.uni-goettingen.de). Users of the free online version are invited to read, download and distribute it. Users may also print a small number for educational or private use. However they may not sell print versions of the online book.
Owning Collections
Faculty and Staff Research and Scholarship - Information Sciences PRIMARY
Articles, papers, and other research and scholarship from iSchool faculty and staffManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…