emrQA: A large corpus for question answering on electronic medical records

Permalink

https://hdl.handle.net/2142/102500

Title: emrQA: A large corpus for question answering on electronic medical records
Author(s): Pampari, Anusri
Issue Date: 2018-12-11
Director of Research (if dissertation) or Advisor (if thesis): Peng, Jian
Department of Study: Computer Science
Discipline: Computer Science
Degree Granting Institution: University of Illinois at Urbana-Champaign
Degree Name: M.S.
Degree Level: Thesis
Keyword(s): Electronic Medical Records, Question Answering, Logical Forms, Semantic Parsing, Dataset Generation, Closed Domain, i2b2
Abstract: We propose a novel methodology to generate domain-specific large-scale question answering (QA) datasets by re-purposing existing annotations for other NLP tasks. We demonstrate an instance of this methodology in generating a large-scale QA dataset for electronic medical records by leveraging existing expert annotations on clinical notes for various NLP tasks from the community shared i2b2 datasets. The resulting corpus (emrQA) has 1 million question-logical form and 400,000+ question-answer evidence pairs. We characterize the dataset and explore its learning potential by training baseline models for question to logical form and question to answer mapping.
Graduation Semester: 2018-12
Type of Resource: text
Permalink: http://hdl.handle.net/2142/102500
Copyright and License Information: Accepted at Conference on Empirical Methods in Natural Language Processing (EMNLP) 2018