Withdraw
Loading…
The impact of author name disambiguation on knowledge discovery from large-scale scholarly data
Kim, Jinseok
Loading…
Permalink
https://hdl.handle.net/2142/98269
Description
- Title
- The impact of author name disambiguation on knowledge discovery from large-scale scholarly data
- Author(s)
- Kim, Jinseok
- Issue Date
- 2017-07-11
- Director of Research (if dissertation) or Advisor (if thesis)
- Diesner, Jana
- Doctoral Committee Chair(s)
- Diesner, Jana
- Committee Member(s)
- Blake, Catherine L.
- Torvik, Vetle I.
- Shumate, Michelle
- Lee, Seok-Hyoung
- Department of Study
- Information Sciences
- Discipline
- Library & Information Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Name disambiguation
- Coauthorship networks
- Data quality
- Bibliometrics
- Big data
- Abstract
- In this study, I demonstrate that the choice of disambiguation methods for resolving author name ambiguity can adversely affect our understanding of scholarly collaboration patterns and coauthorship network structures extracted from large-scale scholarly data. By utilizing large-scale bibliometric data, scholars in many fields have gleaned knowledge for use in scholarly evaluation, collaborator recommendations, research policy evaluation, and network-evolution modeling. A common challenge has been that author names in bibliometric data are not properly disambiguated: authors may share the same name (i.e., different authors are sometimes misrepresented to be a single author which can lead to a “merging of identities”). In addition, one author may use name variations (i.e., an author may be represented as two or more different authors which can lead to a “splitting of identities”). When faced with these challenges, most scholars have pre-processed bibliometric data using simple heuristics (e.g., if two author names share the same surname and given name initials, they are presumed to represent the same author identity) and assumed that their findings are robust to errors due to author name ambiguity. I test this long-held assumption in bibliometrics by measuring the impact of author name ambiguity on network properties. I accomplish this under varying conditions, including network size and cumulative time window (from 1991 to 2009) using four large-scale bibliometric datasets that cover: biomedicine, computer science, psychology and neuroscience, and one nation’s entire domestic publication output. For this task, I collate the statistical properties of coauthorship networks constructed from algorithmically disambiguated data (i.e., close to clean data) against those that come from the same networks, but are compromised by misidentified authors via first-initial and all-initials disambiguation methods. In addition, I simulate the levels of merging and splitting incrementally using those empirical datasets. My findings show that initial-based name disambiguation methods can severely distort our understanding of given networks and such distortion gets worse over time. Moreover, the distortion sometimes leads to biased or false knowledge of coauthorship network formation and evolution mechanisms such as preferential attachment generating the power-law distribution of vertex degree and to false validation of theories about the choice of collaborators in scientific research. This may result in ill-informed decisions about research policy and resource allocation. Besides measuring the impact of name ambiguity on network properties, I also test how name ambiguity can be estimated using simple heuristics such as dataset size and how merged author identities can be detected via an author’s ego-network properties to provide a practical guidance for corrective measures. My research calls for further studying the effects of author name ambiguity on coauthorship network properties and is expected to help scholars establish better practices for knowledge discovery from large-scale scholarly data.
- Graduation Semester
- 2017-08
- Type of Resource
- text
- Permalink
- http://hdl.handle.net/2142/98269
- Copyright and License Information
- Copyright 2017 Jinseok Kim
Owning Collections
Dissertations and Theses - Information Sciences
Dissertations and theses from the School of Information SciencesGraduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…