Compiling contextualized lists of frequent vocabulary from user- supplied corpora using natural language processing techniques

Abdar, Omid

Compiling contextualized lists of frequent vocabulary from user- supplied corpora using natural language processing techniques

Abdar, Omid

Permalink

https://hdl.handle.net/2142/92955

Description

Title: Compiling contextualized lists of frequent vocabulary from user- supplied corpora using natural language processing techniques
Author(s): Abdar, Omid
Issue Date: 2016-07-15
Director of Research (if dissertation) or Advisor (if thesis): Sadler, Randall
Committee Member(s): Schwartz, Lane
Department of Study: Linguistics
Discipline: Teaching of English Sec Lang
Degree Granting Institution: University of Illinois at Urbana-Champaign
Degree Name: M.A.
Degree Level: Thesis
Keyword(s): English for Specific Purposes, Vocabulary, Wordlists, Natural Language Processing
Abstract: Since there are thousands of words to learn in a new language, one common challenge for language learners and teachers is knowing which vocabulary items to prioritize over the others and, in general, setting vocabulary-learning goals. Within vocabulary teaching research, one approach has been to focus on lists of the most common vocabulary. West (1953) proposed a list of the 2000 most frequent word families in English that, it was argued, were most important for learners to master. Along the same lines, Coxhead (2000) offered a list of the most common words in academic English known as the Academic Word List (AWL). Arguing that AWL did not adequately reflect the learners’ specialized vocabulary needs, however, corpus linguists began to develop wordlists in specialized subject areas with an English for Specific Purposes (ESP) perspective for students in Business, Engineering, Medical, and Law majors and so on. A central theme in almost all previous endeavors to develop better wordlists has been the notion of 'representativeness'—the extent to which a wordlist 'represents' the language needs of leaners. In this study, it is proposed that an alternative way to maximize representativeness in a wordlist is to enable users to compile a wordlist from any text or corpus that is of interest to them and to provide the means of compiling a wordlist using that text. Using Natural Language Toolkit (NLTK), this study shows how a few Natural Language Processing (NLP) techniques may be used to compile a list of the most common words in the Europarl corpus along with retrieving example sentences from the corpus for each word. This new approach can have applications for both language leaners as well as for the purposes of preparing instructional materials in an ESP setting.
Graduation Semester: 2016-08
Type of Resource: text
Permalink: http://hdl.handle.net/2142/92955

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Compiling contextualized lists of frequent vocabulary from user- supplied corpora using natural language processing techniques

Abdar, Omid

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Log In