Predicting controlled vocabulary based on text and citations: Case studies in medical subject headings in MEDLINE and patents
Kehoe, Adam K.
Loading…
Permalink
https://hdl.handle.net/2142/105645
Description
Title
Predicting controlled vocabulary based on text and citations: Case studies in medical subject headings in MEDLINE and patents
Author(s)
Kehoe, Adam K.
Issue Date
2019-07-09
Director of Research (if dissertation) or Advisor (if thesis)
Torvik, Vetle I
Doctoral Committee Chair(s)
Torvik, Vetle I
Committee Member(s)
Smalheiser, Neil R
Dubin, David S
Ludäscher, Bertram
Downie, John S
Department of Study
Information Sciences
Discipline
Library & Information Science
Degree Granting Institution
University of Illinois at Urbana-Champaign
Degree Name
Ph.D.
Degree Level
Dissertation
Keyword(s)
Controlled vocabulary
Medical Subject Headings
Controlled Vocabulary Prediction
Abstract
This dissertation makes three contributions in the area of controlled vocabulary prediction of Medical Subject Headings. The first contribution is a new partial matching measure based on distributional semantics. The second contribution is a probabilistic model based on text similarity and citations. The third contribution is a case study of cross-domain vocabulary prediction in US Patents. Medical subject headings (MeSH) are an important life sciences controlled vocabulary. They are an ideal ground to study controlled vocabulary prediction due to their complexity, hierarchical nature, and practical significance. The dissertation begins with an updated analysis of human indexing consistency in MEDLINE. This study demonstrates the need for partial matching measures to account for indexing variability. Here, I develop four measures combining the MeSH hierarchy and contextual similarity. These measures provide several new tools for evaluating and diagnosing controlled vocabulary models. Next, a generalized predictive model is introduced. This model uses citations and abstract similarity as inputs to a hybrid KNN classifier. Citations and abstracts are found to be complimentary in that they reliably produce unique and relevant candidate terms. Finally, the predictive model is applied to a corpus of approximately 65,000 biomedical US patents. This case study explores differences in the vocabulary of MEDLINE and patents, as well as the prospect for MeSH prediction to open new scholarly opportunities in economics and health policy research.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.