Utilizing multiple entities from collection of unstructured documents in constructing attribute-value pairs

Cho, Hyun Duk

Utilizing multiple entities from collection of unstructured documents in constructing attribute-value pairs

Cho, Hyun Duk

Content Files

Cho_Hyun Duk.pdf

Permalink

https://hdl.handle.net/2142/34506

Description

Title

Utilizing multiple entities from collection of unstructured documents in constructing attribute-value pairs

Author(s)

Cho, Hyun Duk

Issue Date

2012-09-18T21:20:36Z

Director of Research (if dissertation) or Advisor (if thesis)

Zhai, ChengXiang

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

M.S.

Degree Level

Thesis

Date of Ingest

2012-09-18T21:20:36Z

Keyword(s)

attribute extraction
(attribute-value pair) nvp
value extraction
evaluation

Abstract

Attribute-value pairs, or NVP is defined as extracting words expressing characteristics of entity and associating the said words with word or phrases that best describe the attributes. Applications for NVP arise in various related area such as sentiment analysis, populating and checking for errors in relational database to a broader text information area such as QA systems, search and review modeling. We propose an unsupervised method to identify the properties of entities represented as NVP from unstructured documents. Other approaches that extract NVP usually uti- lize supervised or semi-supervised approaches on structured or semi-structured documents. Benefits of such approaches lie in that they tend to have higher accuracy than unsuper- vised approaches on unstructured documents. Furthermore, supervised approaches are more suited to distinguishing attribute words to that of value words than unsupervised approaches on unstructured documents. The biggest drawback with the said methods however, is that training data may not always be available and not all documents can be thought of as being unstructured. We first proposes in this thesis an approach to extracting and distinguishing attribute words and value words from unstructured documents. Since entities of the same class share similar attributes, we propose that the identification of relevant attributes should be done across entities belonging to the same class, and demonstrate that this can lead to a significant performance gain in attribute extraction, even when only documents describing a modest number of entities per class is available. We then propose a way to evaluate the accuracy of attribute-value pairs automatically, allowing for quantitative comparison between different systems that is more consistent and cost-effective than manual evaluations. These were used in evaluating summarization or comparing ontologies. However, these techniques have not been utilized in evaluating NVP. Both the automated and manual evaluations show that our system outperforms a comparison system.

Graduation Semester

2012-08

Permalink

http://hdl.handle.net/2142/34506

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Computer Science

Dissertations and Theses from the Siebel School of Computer Science

Utilizing multiple entities from collection of unstructured documents in constructing attribute-value pairs

Cho, Hyun Duk

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Computer Science

Log In