Consistent and efficient long document understanding

Zeng, Qi

Consistent and efficient long document understanding

Zeng, Qi

Permalink

https://hdl.handle.net/2142/121969

Description

Title

Consistent and efficient long document understanding

Author(s)

Zeng, Qi

Issue Date

2023-11-03

Director of Research (if dissertation) or Advisor (if thesis)

Ji, Heng

Doctoral Committee Chair(s)

Ji, Heng

Committee Member(s)

Tong, Hanghang
Zhao, Han
Wang, Lu
Li, Lei

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

Natural Language Processing

Abstract

In the age of information overload, people's information needs from long documents are rapidly emerging, while people's patience for careful reading and reasoning is gradually vanishing. While people are inundated with large amounts of long textual documents covering topics in various domains, such as news, healthcare, legal service, and finance, they struggle to gain quick, concise, and accurate insights from these long and tedious documents. The development of automatic document understanding systems promises the possibility of assisting humans in gaining insights from those long documents. Automatic systems capture and analyze the information contained in a collection of news and scientific reports in a concise and machine-understandable way. Automatic systems parse unstructured text by identifying the relations between events and entities from long complex reading for structured data usage. Automatic systems provide reliable digests by factually and consistently summarizing recent papers, reports, news, and reviews. However, automatically understanding long documents remains a challenge because recent state-of-the-art document understanding systems are mostly built upon transformer structures and are mostly motivated, designed, implemented, and evaluated under the short-input setting. To adapt those short-input systems to long sequences, documents have to be truncated, chunked using a sliding window, or processed in parallel on multiple machines. These additional operations usually cause the loss of long-range interdependency and introduce additional costs. Therefore, this thesis focuses on developing principled and scalable methods for more consistent and efficient long document understanding. In particular, we investigate four research problems from the perspectives of consistency and efficiency: 1) Consistent Meta-review Generation. Current work on Opinion Summarization extracts and selects representing opinions on aspects of interest under the assumption that input opinions are non-controversial. Opinions in the scientific domain can be divergent, leading to controversy or consensus among reviewers, while the scientific meta-review should be consistent with the synthesized opinions from individual reviews. Therefore, we propose to benchmark scientific opinion summarization by collecting paper meta-reviews from OpenReview, proposing a Checklist-guided Iterative Introspection approach, and constructing a comprehensive evaluation framework. 2) Consistent Document Summarization. Current abstractive summarization models often generate inconsistent content, i.e. texts that are not directly inferable from the source document, are not consistent with respect to world knowledge, or are self-contradictory. To improve the general consistency we introduce EnergySum, where we apply the Residual Energy-based Model by designing energy scorers that reflect each type of consistency and incorporating them into the sampling process. 3) Consistent Document-level Event Argument Extraction. Recent work on document-level event argument extraction models each individual event in isolation and therefore causes inconsistency among extracted arguments across events, which will further cause discrepancies for downstream applications. To address this problem, we formulate event argument consistency as the constraints from event-event relations under the document-level setting and further introduce the Event-Aware Argument Extraction (EA$^2$E) model with augmented context for training and inference. 4) Efficient Document Processing. Transformer-based models are inefficient in processing long sequences due to the quadratic space and time complexity in the self-attention modules. To address this limitation, we introduce two methods for self-attention acceleration, a modified Nystr\"om method (Skyformer) to accelerate kernelized attention and stabilize training and a Sketching-based method (Skeinformer) that applies sub-sampling sketching.

Graduation Semester

2023-12

Type of Resource

Thesis

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Consistent and efficient long document understanding

Zeng, Qi

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Log In