Automatically identifying facet roles from comparative structures to support biomedical text summarization

Lucic, Ana

Automatically identifying facet roles from comparative structures to support biomedical text summarization

Lucic, Ana

Permalink

https://hdl.handle.net/2142/98087

Description

Title

Automatically identifying facet roles from comparative structures to support biomedical text summarization

Author(s)

Lucic, Ana

Issue Date

2017-06-26

Director of Research (if dissertation) or Advisor (if thesis)

Blake, Catherine Lesley

Doctoral Committee Chair(s)

Blake, Catherine Lesley

Committee Member(s)

Girju, Corina Roxana
Efron, Miles
Renear, Allen H.
Downie, J. Stephen

Department of Study

Information Sciences

Discipline

Library & Information Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

Comparison sentences
Natural language processing
Text mining
Text summarization
Information extraction

Abstract

Within the context of biomedical scholarly articles, comparison sentences represent a rhetorical structure commonly used to communicate findings. More generally, comparison sentences are rich with information about how the properties of one or more entities relate one another. So far, in the biomedical domain, the emphasis has been on recognizing comparative sentences in the text. This dissertation goes beyond sentence-level recognition and aims to automate the identification of the integral parts of a comparison sentence which are called comparative facets and include: compared entities, the basis or the endpoint of comparison as well as the result or the relationship that binds the entities and the basis. Only the sentences that contain each of the four facets are of interest in this thesis. With respect to the first compared entity, the system achieves an average F1 on a random sample of short (between 11 and 21 words long) sentences of 0.65; medium (between 22 and <= 28 words) sentences 0.70; long (between 29 and <=36 words) sentences 0.60 and very long (more than 36 words), 0.60. With respect to the basis of comparison prediction (the endpoint), the average F1 measure ranged from 0.66 on short, 0.57 on medium, 0.56 on long, and 0.50 on very long sentences. The average F1 achieved with respect to the second entity compared ranged from 0.91 on short, 0.85 on medium, 0.81 on long and 0.72 on very long sentences. In the area of semantic relation identification, the performance achieved was also sensitive to sentence length: the average F1 measure on short sentences was 0.80; it was 0.71, 0.56, and 0.51 on medium, long, and very long sentences respectively. Thus, the methods developed in this dissertation work better on sentences that are shorter (<= 28 words) and on those that do not contain multiple claims or disjunctive conjunctions. When applied to a previously unseen collection of breast cancer articles, the performance achieved with respect to the identification of compared entities and the endpoint was comparable to the results achieved on the collection that was used for building and testing the models. This result is promising with respect to the potential of this model being applied on other collections of scholarly articles in the biomedical sciences.

Graduation Semester

2017-08

Type of Resource

text

Permalink

http://hdl.handle.net/2142/98087

Copyright and License Information

Owning Collections

Dissertations and Theses - Information Sciences

Dissertations and theses from the School of Information Sciences

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Automatically identifying facet roles from comparative structures to support biomedical text summarization

Lucic, Ana

Permalink

Description

Owning Collections

Dissertations and Theses - Information Sciences

Graduate Dissertations and Theses at Illinois PRIMARY

Log In