Withdraw
Loading…
Exploiting knowledge in NLP
Ratinov, Lev
Loading…
Permalink
https://hdl.handle.net/2142/31198
Description
- Title
- Exploiting knowledge in NLP
- Author(s)
- Ratinov, Lev
- Issue Date
- 2012-05-22T00:35:06Z
- Director of Research (if dissertation) or Advisor (if thesis)
- Roth, Dan
- Doctoral Committee Chair(s)
- Roth, Dan
- Committee Member(s)
- Han, Jiawei
- Zhai, ChengXiang
- Mihalcea, Rada
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Machine Learning
- Natural language processing (NLP)
- Text Classification
- Co-reference Resolution
- Concept Disambiguation
- Information Extraction
- Named Entity Recognition
- Semi-Supervised Learning.
- Abstract
- "In recent decades, the society depends more and more on computers for a large number of tasks. The first steps in NLP applications involve identification of topics, entities, concepts, and relations in text. Traditionally, statistical models have been successfully deployed for the aforementioned problems. However, the major trend so far has been: “scaling up by dumbing down”- that is, applying sophisticated statistical algorithms operating on very simple or low-level features of the text. This trend is also exemplified, by expressions such as ""we present a knowledge-lean approach"", which have been traditionally viewed as a positive statement, one that will help papers get into top conferences. This thesis suggests that it is essential to use knowledge in NLP, proposes several ways of doing it, and provides case studies on several fundamental NLP problems. It is clear that humans use a lot of knowledge when understanding text. Let us consider the following text ""Carnahan campaigned with Al Gore whenever the vice president was in Missouri."" and ask two questions: (1) who is the vice president? (2) is this sentence about politics or sports? A knowledge-lean NLP approach will have a great difficulty answering the first question, and will require a lot of training data to answer the second one. On the other hand, people can answer both questions effortlessly. We are not the first to suggest that NLP requires knowledge. One of the first such large-scale efforts, CYC, has started in 1984, and by 1995 has consumed a person-century of effort collecting 100000 concepts and 1000000 commonsense axioms, including ""You can usually see peoples noses, but not their hearts"". Unfortunately, such an effort has several problems. (a) The set of facts we can deduce is significantly larger than 1M . For example, in the above example ""heart"" can be replaced by any internal organ or tissue, as well as by a bank account, thoughts etc., leading to thousands of axioms. (b) The axioms often do not hold. For example, if the person is standing with their back to you, can cannot see their nose. And during an open heart surgery, you can see someone's heart. (c) Matching the concepts to natural-language expressions is challenging. For example, ""Al Gore"" can be referred to as ""Democrat"", ""environmentalist"", ""vice president"", ""Nobel prize laureate"" among other things. The idea of ""buying a used car"" can be also expressed as ""purchasing a pre-owned automobile"". Lexical variability in text makes using knowledge challenging. Instead of focusing on obtaining a large set of logic axioms, we are focusing on using knowledge-rich features in NLP solutions. We have used three sources of knowledge: a large corpus of unlabeled text, encyclopedic knowledge derived from Wikipedia and first-order-logic-like constraints within a machine learning framework. Namely, we have developed a Named Entity Recognition system which uses word representations induced from unlabeled text and gazetteers extracted from Wikipedia to achieve new state of the art performance. We have investigated the implications of augmenting text representation with a set of Wikipedia concepts. The concepts can either be directly mentioned in the documents, or not explicitly mentioned but closely related. We have shown that such document representation allows more efficient search and categorization than the traditional lexical representations. Our next step is using the knowledge injected from Wikipedia for co-reference resolution. While the majority of the knowledge in this thesis is encyclopedic, we have also investigated how knowledge about the structure of the problem in the form of constraints can allow leveraging unlabeled data in semi-supervised settings. This thesis shows how to use knowledge to improve state-of-the-art for four fundamental problems in NLP: text categorization, information extraction, concept disambiguation and coreference resolution, four tasks which have been considered the bedrock of NLP since its inception."
- Graduation Semester
- 2012-05
- Permalink
- http://hdl.handle.net/2142/31198
- Copyright and License Information
- Copyright 2012 Lev Ratinov
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisDissertations and Theses - Computer Science
Dissertations and Theses from the Dept. of Computer ScienceManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…