Joint classification and information extraction framework

Rameshkumar, Revanth

Joint classification and information extraction framework

Rameshkumar, Revanth

This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.

Permalink

https://hdl.handle.net/2142/91561

Description

Title

Joint classification and information extraction framework

Author(s)

Rameshkumar, Revanth

Contributor(s)

Nahrstedt, Klara

Issue Date

2016-05

Keyword(s)

natural language processing
machine learning models
joint models
text classification
information extraction

Abstract

This thesis proposes a joint Information-Extraction and Classification model for document analysis in domain specific text. Existing information extraction (IE) systems typically try to extract key value pairs or target phrases by learning from user-provided examples or depend on a strong named-entity tagger, as in the Snowball information extraction system. Others, while not depending on user provided IE patterns, end up depending on part of speech, syntactic, or semantic tagged data to extract target phrases; or depend on heavily annotated text to build a learning dictionary. The disadvantage with this is that it takes many man-hours to build a usable training dataset. This is especially disadvantageous when the cost of assigning a domain expert to tasks like tagging and annotating is too great to be practical. This thesis describes a prototype system RICE (Rev’s Iterative Classifier Extractor) that is able to extract information from domain specific text using only a set of labeled (domain relevant or domain irrelevant) documents. The system is trained using only labeled documents and outputs a set of relevant phrases, an Information Extraction Pattern ranker model, and a usable document classifier. An iterative approach is used where extracted noun phrases are used to both simultaneously train a classifier and build a ranked IE Pattern list. The results show that the joint classification and IE model approach definitely works and produces results that are greater enough than chance that the model is worth further pursuit. In fact, it definitely has the potential to be used in production systems. However, there is quite a bit of work that needs to be done to eliminate noise and increase precision. We also discuss next steps, improvements, applications, and future works at the end of the thesis.

Type of Resource

text

Language

Permalink

http://hdl.handle.net/2142/91561

Owning Collections

Senior Theses - Electrical and Computer Engineering PRIMARY

The best of ECE undergraduate research

Joint classification and information extraction framework

Rameshkumar, Revanth

Permalink

Description

Owning Collections

Senior Theses - Electrical and Computer Engineering PRIMARY

Log In