Classifying GitHub repositories with minimal human efforts
Zhang, Yu
Loading…
Permalink
https://hdl.handle.net/2142/104942
Description
Title
Classifying GitHub repositories with minimal human efforts
Author(s)
Zhang, Yu
Issue Date
2019-04-26
Director of Research (if dissertation) or Advisor (if thesis)
Han, Jiawei
Department of Study
Computer Science
Discipline
Computer Science
Degree Granting Institution
University of Illinois at Urbana-Champaign
Degree Name
M.S.
Degree Level
Thesis
Keyword(s)
GitHub
Classification
Weak Supervision
Abstract
GitHub is a great platform for sharing software code, data, and other resources. To improve search and analysis of a vast spectrum of resources on GitHub, it is necessary to conduct automatic, flexible and user-guided classification of GitHub repositories. In this paper, we study how to build a customized repository classifier with minimal human annotation. Previous document classification methods cannot be directly applied to our task due to three unique challenges: (1) multi-modal signals: besides text, signals in other formats need to be explored to uncover the topic of a repository; (2) low data quality: GitHub README files, usually containing code and commands, are noisier than typical text data such as scientific papers and news; and (3) limited ground-truth: users cannot afford to label many repositories for training a good classifier. To deal with the challenges above, we propose GitClass, a framework to classify GitHub repositories under weak supervision. Three key modules, heterogeneous network construction and embedding, keyword extraction and topic modeling, as well as pseudo document generation, are used to tackle the above three challenges, respectively. We conduct extensive experiments on three large-scale GitHub repository datasets and observe evident performance boost over state-of-the-art embedding and classification algorithms.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.