Data quality in the deep learning era: Active semi-supervised learning and text normalization for natural language understanding

Lourentzou, Ismini

Data quality in the deep learning era: Active semi-supervised learning and text normalization for natural language understanding

Lourentzou, Ismini

Permalink

https://hdl.handle.net/2142/106375

Description

Title

Data quality in the deep learning era: Active semi-supervised learning and text normalization for natural language understanding

Author(s)

Lourentzou, Ismini

Issue Date

2019-12-05

Director of Research (if dissertation) or Advisor (if thesis)

Zhai, ChengXiang

Doctoral Committee Chair(s)

Zhai, ChengXiang

Committee Member(s)

Hockenmaier, Julia
Peng, Jian
Gruhl, Daniel

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

deep learning
machine learning
active learning
semi-supervised learning
text normalization
lexical normalizartion
sequence to sequence
relation extraction
neural networks
natural language processing
data quality
self-paced learning
calibration

Abstract

Deep Learning, a growing sub-field of machine learning, has been applied with tremendous success in a variety of domains, opening opportunities for achieving human level performance in many applications. However, Deep Learning methods depend on large quantities of data with millions of annotated instances. And while well-formed academic datasets have helped advance supervised learning research, in the real-word we are daily deluged by massive amounts of unstructured data, that remain unusable for current supervised learning approaches, as only a small portion is either labeled, cleaned or structured. In order for a machine learning model to be effective, volume is not the only data dimension that is necessary. Quality is equally important and has proven to be a critical factor for the success of industrial applications of machine learning. According to IBM, poor data quality can cost more than 3 trillion US dollars per year for the US market alone. Inspired by the need for advanced methods that can efficiently address such bottlenecks, we develop machine learning techniques can be leveraged to improve upon data quality in both data-related dimensions: input and output space. Having a set of labeled examples that can capture the task characteristics is one of the most important prerequisites for successfully applying machine learning. As such, we first focus on minimizing the annotation effort for any arbitrary user-defined task by exploring active learning methods. We show that the best performing active learning strategy depends on the task at-hand and we propose a combination of active learners, maximizing annotation performance early in the process. We demonstrate the viability of the approach on several relation extraction tasks. Next, we observe that even though our method can be used to speed up the collection of labeled training data, the rest will remain unlabeled and thus unexploited. Semi-supervised learning methods proposed in the literature can utilize additional unlabeled data, however, are typically compared on computer vision datasets such as CIFAR10. Here, we perform a systematic exploration of several semi-supervised methods for three sequence labeling tasks and two classification tasks. Additionally, most methods have assumptions that are less suitable to realistic scenarios. For example, proposed methods in the recent literature treat all unlabeled examples equally. Yet, in many cases we would like to sort out examples that might be less useful or confusing, particularly in noisy settings where examples with low training loss or high confidence are more likely to be clean examples. In addition, most methods assume that the unlabeled data can be classified into the same classes as the labeled data. This does not take into consideration the very possible scenario of out-of-class instances. For example, our classifier may be distinguishing cats from dogs, but the unlabeled examples may contain additional classes, such as shells, butterflies, etc. To this end, we design methods to mitigate these issues, with a re-weighting mechanism that can be incorporated to any consistency-based regularizer. Both active and semi-supervised learning methods aim to reduce labeling efforts by either automatically expanding the training set or selecting the most informative examples for human annotation. However, bootstrapping approaches often result in negative effects on NLP tasks due to the addition of falsely labeled instances. We address the challenge of producing good quality proxy labels, by leveraging the continuously growing stream of human annotations. We introduce a calibration of semi-supervised active learning where the confidence of the classifier is weighted by an auxiliary neural model that remove incorrectly labeled instances and dynamically adjusts the number of proxy labels included in each iteration. Experimental results show that our strategy outperforms baselines that combine traditional active learning with self-training. We have explored various ways on how to improve the output space of examples. But the input representation is also equally important. Particularly for social media, (the most abundant source of raw data nowadays) informal writing can cause several bottlenecks. For example, most Information Extraction (IE) tools rely on accurate understanding of text and struggle with the noisy and informal nature of social media due to high out-of-vocabulary (OOV) word rates. In this work, we design a social media text normalization hybrid word-character attention-based encoder-decoder model that can serve as a pre-processing step for any off-the-shelf NLP tool to adapt to social media noisy text. Our model surpasses baseline neural models designed for text normalization and achieves comparable performance with state-of-the-art related work. Although we evaluate on NLP tasks, all methods developed are fairly general and can be applied to other supervised machine learning tasks in need of techniques that create meaningful data representations and simultaneously reduce the burden and cost of human annotations.

Graduation Semester

2019-12

Type of Resource

text

Permalink

http://hdl.handle.net/2142/106375

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Computer Science

Dissertations and Theses from the Dept. of Computer Science

Data quality in the deep learning era: Active semi-supervised learning and text normalization for natural language understanding

Lourentzou, Ismini

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Computer Science

Log In