Withdraw
Loading…
Generating regular expressions from natural language specifications: A semantics-based approach and an empirical study
Zhong, Zexuan
Loading…
Permalink
https://hdl.handle.net/2142/105101
Description
- Title
- Generating regular expressions from natural language specifications: A semantics-based approach and an empirical study
- Author(s)
- Zhong, Zexuan
- Issue Date
- 2019-04-26
- Director of Research (if dissertation) or Advisor (if thesis)
- Xie, Tao
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- M.S.
- Degree Level
- Thesis
- Keyword(s)
- Regular Expression
- Natural Language Processing
- Abstract
- Translating natural language descriptions into executable programs is a fundamental problem for computational linguistics. Recent research proposes neural-network-based approaches to address the problem. These approaches typically train a sequence-to-sequence learning model using a syntax-based objective: maximum likelihood estimation (MLE). Such syntax-based approaches do not effectively address the goal of generating semantically correct programs, because these approaches fail to handle Program Aliasing, i.e., semantically equivalent programs may have many syntactically different forms. In this thesis, we focus on generating regular expressions from natural language, an important task of the program-synthesis problem. In particular, we study the task in two aspects. First, we address the issue of Program Aliasing, and propose a semantics-based approach named SemRegex. Different from the existing syntax-based approaches, SemRegex trains the model by maximizing the expected semantic correctness of the generated regular expressions. The semantic correctness is measured using the DFA-equivalence oracle, random test cases, and distinguishing test cases. The experiments on three public datasets demonstrate the superiority of SemRegex over the existing state-of-the-art approaches. Second, given that existing approaches use only synthetic data in both training datasets and validation/test datasets, we raise a question: are these approaches effective to address various real-world situations? To explore this question, we conduct a characteristic study on comparing two synthetic datasets used by the recent research and a real-world dataset collected from the Internet, and conduct an experimental study on applying an existing state-of-the-art approach on the real-world dataset. Our study results suggest the existence of distinct characteristics between the synthetic datasets and the real-world dataset, and the existing state-of-the-art approach achieves extremely low effectiveness when evaluated on real-world data. We also provide initial analysis on some of those challenging cases and discuss future directions.
- Graduation Semester
- 2019-05
- Type of Resource
- text
- Permalink
- http://hdl.handle.net/2142/105101
- Copyright and License Information
- Copyright 2019 Zexuan Zhong
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisDissertations and Theses - Computer Science
Dissertations and Theses from the Dept. of Computer ScienceManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…