Withdraw
Loading…
Query K-means clustering for crowdsourcing
Pan, Chao
Loading…
Permalink
https://hdl.handle.net/2142/106229
Description
- Title
- Query K-means clustering for crowdsourcing
- Author(s)
- Pan, Chao
- Issue Date
- 2019-12-02
- Director of Research (if dissertation) or Advisor (if thesis)
- Milenkovic, Olgica
- Department of Study
- Electrical & Computer Eng
- Discipline
- Electrical & Computer Engr
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- M.S.
- Degree Level
- Thesis
- Keyword(s)
- K-means clustering
- active learning
- semi-supervised learning
- coupon collector's problem
- crowdsourcing
- Abstract
- This thesis focuses on solving the $K$-means clustering problem approximately with side information provided by crowdsourcing. Both binary same-cluster oracle and general crowdsourcing framework are considered. It can be shown that, under some mild assumptions on the smallest cluster size, one can obtain a $(1+\epsilon)$-approximation for the optimal potential with probability at least $1-\delta$, where $\epsilon>0$ and $\delta\in(0,1)$, using an expected number of $O(\frac{K^3}{\epsilon \delta})$ noiseless same-cluster queries and comparison-based clustering of complexity $O(ndK + \frac{K^3}{\epsilon \delta})$; here, $n$ denotes the number of points and $d$ the dimension of space. Compared to a handful of other known approaches that perform importance sampling to account for small cluster sizes, the proposed query technique reduces the number of queries by a factor of roughly $O(\frac{K^6}{\epsilon^3})$, at the cost of possibly missing very small clusters. This setting is extended to the case where some queries to the oracle produce erroneous information, and where certain points, termed outliers, do not belong to any clusters. Incorporating state-of-the-art results in crowdsourcing can further improve the performance of the algorithm. Note that the proof techniques used in this thesis differ from previous methods used for $K$-means clustering analysis, as they rely on estimating the sizes of the clusters and the number of points needed for accurate centroid estimation and subsequent nontrivial generalizations of the double Dixie cup problem. The performances of proposed algorithms are illustrated on both synthetic and real datasets, including MNIST and CIFAR $10$.
- Graduation Semester
- 2019-12
- Type of Resource
- text
- Permalink
- http://hdl.handle.net/2142/106229
- Copyright and License Information
- Copyright 2019 Chao Pan
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisDissertations and Theses - Electrical and Computer Engineering
Dissertations and Theses in Electrical and Computer EngineeringManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…