Withdraw
Loading…
Classifying GitHub repositories with minimal human efforts
Zhang, Yu
Content Files

Loading…
Download Files
Loading…
Download Counts (All Files)
Loading…
Edit File
Loading…
Permalink
https://hdl.handle.net/2142/104942
Description
- Title
- Classifying GitHub repositories with minimal human efforts
- Author(s)
- Zhang, Yu
- Issue Date
- 2019-04-26
- Director of Research (if dissertation) or Advisor (if thesis)
- Han, Jiawei
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- M.S.
- Degree Level
- Thesis
- Date of Ingest
- 2019-08-23T20:05:22Z
- Keyword(s)
- GitHub
- Classification
- Weak Supervision
- Abstract
- GitHub is a great platform for sharing software code, data, and other resources. To improve search and analysis of a vast spectrum of resources on GitHub, it is necessary to conduct automatic, flexible and user-guided classification of GitHub repositories. In this paper, we study how to build a customized repository classifier with minimal human annotation. Previous document classification methods cannot be directly applied to our task due to three unique challenges: (1) multi-modal signals: besides text, signals in other formats need to be explored to uncover the topic of a repository; (2) low data quality: GitHub README files, usually containing code and commands, are noisier than typical text data such as scientific papers and news; and (3) limited ground-truth: users cannot afford to label many repositories for training a good classifier. To deal with the challenges above, we propose GitClass, a framework to classify GitHub repositories under weak supervision. Three key modules, heterogeneous network construction and embedding, keyword extraction and topic modeling, as well as pseudo document generation, are used to tackle the above three challenges, respectively. We conduct extensive experiments on three large-scale GitHub repository datasets and observe evident performance boost over state-of-the-art embedding and classification algorithms.
- Graduation Semester
- 2019-05
- Type of Resource
- text
- Permalink
- http://hdl.handle.net/2142/104942
- Copyright and License Information
- Copyright 2019 Yu Zhang
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisDissertations and Theses - Computer Science
Dissertations and Theses from the Siebel School of Computer ScienceManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…