Generating regular expressions from natural language specifications: A semantics-based approach and an empirical study

Zhong, Zexuan

Generating regular expressions from natural language specifications: A semantics-based approach and an empirical study

Zhong, Zexuan

Permalink

https://hdl.handle.net/2142/105101

Description

Title

Generating regular expressions from natural language specifications: A semantics-based approach and an empirical study

Author(s)

Zhong, Zexuan

Issue Date

2019-04-26

Director of Research (if dissertation) or Advisor (if thesis)

Xie, Tao

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

M.S.

Degree Level

Thesis

Date of Ingest

2019-08-23T20:36:12Z

Keyword(s)

Regular Expression
Natural Language Processing

Abstract

Translating natural language descriptions into executable programs is a fundamental problem for computational linguistics. Recent research proposes neural-network-based approaches to address the problem. These approaches typically train a sequence-to-sequence learning model using a syntax-based objective: maximum likelihood estimation (MLE). Such syntax-based approaches do not effectively address the goal of generating semantically correct programs, because these approaches fail to handle Program Aliasing, i.e., semantically equivalent programs may have many syntactically different forms. In this thesis, we focus on generating regular expressions from natural language, an important task of the program-synthesis problem. In particular, we study the task in two aspects. First, we address the issue of Program Aliasing, and propose a semantics-based approach named SemRegex. Different from the existing syntax-based approaches, SemRegex trains the model by maximizing the expected semantic correctness of the generated regular expressions. The semantic correctness is measured using the DFA-equivalence oracle, random test cases, and distinguishing test cases. The experiments on three public datasets demonstrate the superiority of SemRegex over the existing state-of-the-art approaches. Second, given that existing approaches use only synthetic data in both training datasets and validation/test datasets, we raise a question: are these approaches effective to address various real-world situations? To explore this question, we conduct a characteristic study on comparing two synthetic datasets used by the recent research and a real-world dataset collected from the Internet, and conduct an experimental study on applying an existing state-of-the-art approach on the real-world dataset. Our study results suggest the existence of distinct characteristics between the synthetic datasets and the real-world dataset, and the existing state-of-the-art approach achieves extremely low effectiveness when evaluated on real-world data. We also provide initial analysis on some of those challenging cases and discuss future directions.

Graduation Semester

2019-05

Type of Resource

text

Permalink

http://hdl.handle.net/2142/105101

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Computer Science

Dissertations and Theses from the Siebel School of Computer Science

Generating regular expressions from natural language specifications: A semantics-based approach and an empirical study

Zhong, Zexuan

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Computer Science

Log In