Withdraw
Loading…
Information extraction from digital social trace data with applications to social media and scholarly communication data
Mishra, Shubhanshu
Loading…
Permalink
https://hdl.handle.net/2142/107965
Description
- Title
- Information extraction from digital social trace data with applications to social media and scholarly communication data
- Author(s)
- Mishra, Shubhanshu
- Issue Date
- 2020-05-03
- Director of Research (if dissertation) or Advisor (if thesis)
- Diesner, Jana
- Doctoral Committee Chair(s)
- Diesner, Jana
- Committee Member(s)
- Torvik, Vetle I
- Karahalios, Karrie G
- Brunner, Robert J
- Department of Study
- Information Sciences
- Discipline
- Library & Information Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Social Media Analysis
- Machine Learning
- Data Mining
- Scholarly Data Analysis
- Digital Libraries
- Visualization
- Computer Science
- Information Science
- Open Source
- Multi task learning
- Deep Learning
- Active Learning
- Natural Language Processing
- Big Data Analysis
- Abstract
- Information extraction (IE) aims at extracting structured data from unstructured or semi-structured data. The thesis starts by identifying social media data and scholarly communication data as a special case of digital social trace data (DSTD). This identification allows us to utilize the graph structure of the data (e.g., user connected to a tweet, author connected to a paper, author connected to authors, etc.) for developing new information extraction tasks. The thesis focuses on information extraction from DSTD, first, using only the text data from tweets and scholarly paper abstracts, and then using the full graph structure of Twitter and scholarly communications datasets. This thesis makes three major contributions. First, new IE tasks based on DSTD representation of the data are introduced. For scholarly communication data, methods are developed to identify article and author level novelty and expertise. Furthermore, interfaces for examining the extracted information are introduced. A social communication temporal graph (SCTG) is introduced for comparing different communication data like tweets tagged with sentiment, tweets about a search query, and Facebook group posts. For social media, new text classification categories are introduced, with the aim of identifying enthusiastic and supportive users, via their tweets. Additionally, the correlation between sentiment classes and Twitter meta-data in public corpora is analyzed, leading to the development of a better model for sentiment classification. Second, methods are introduced for extracting information from social media and scholarly data. For scholarly data, a semi-automatic method is introduced for the construction of a large-scale taxonomy of computer science concepts. The method relies on the Wikipedia category tree. The constructed taxonomy is used for identifying key computer science phrases in scholarly papers, and tracking their evolution over time. Similarly, for social media data, machine learning models based on human-in-the-loop learning, semi-supervised learning, and multi-task learning are introduced for identifying sentiment, named entities, part of speech tags, phrase chunks, and super-sense tags. The machine learning models are developed with a focus on leveraging all available data. The multi-task models presented here result in competitive performance against other methods, for most of the tasks, while reducing inference time computational costs. Finally, this thesis has resulted in the creation of multiple open source tools and public data sets, which can be utilized by the research community. The thesis aims to act as a bridge between research questions and techniques used in DSTD from different domains. The methods and tools presented here can help advance work in the areas of social media and scholarly data analysis. All resources related to this thesis are available at https://shubhanshu.com/phd_thesis/
- Graduation Semester
- 2020-05
- Type of Resource
- Thesis
- Permalink
- http://hdl.handle.net/2142/107965
- Copyright and License Information
- Copyright 2020 by Shubhanshu Mishra. All rights reserved.
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisDissertations and Theses - Information Sciences
Dissertations and theses from the School of Information SciencesManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…