Withdraw
Loading…
Attacks on medical data and centralized social platforms: PU learning and graphical inference
Su, Du
Loading…
Permalink
https://hdl.handle.net/2142/113122
Description
- Title
- Attacks on medical data and centralized social platforms: PU learning and graphical inference
- Author(s)
- Su, Du
- Issue Date
- 2021-06-15
- Director of Research (if dissertation) or Advisor (if thesis)
- Lu, Yi
- Doctoral Committee Chair(s)
- Lu, Yi
- Committee Member(s)
- Hajek, Bruce
- Srikant, Rayadurgam
- Viswanath, Pramod
- Department of Study
- Electrical & Computer Eng
- Discipline
- Electrical & Computer Engr
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- privacy
- social network
- data-mining
- density evolution
- neural networks
- Abstract
- Sensitive information in user data, such as health status, financial history, and personal preference is facing a high risk of being exposed to adversarial entities, as well as being abused by the service provider. Prohibiting individual-level operations on user records is one natural proposal to protect the sensitive information of user data. Instead, only aggregate statistics of sensitive data are allowed to be stored or released. It is widely believed that aggregation is an effective way of preserving the typical characteristics of users for data mining tasks while preventing the infringement of individual privacy, as the aggregated statistics carry little information of each individual user. However, these privacy-preserving techniques may fail to eliminate the risks of sensitive data leakage. Aggregation is not able to fully hide features of the dataset, and adversarial entities are able to exploit the features of aggregate data to recover the sensitive information in user records. Moreover, social platforms with a high degree of centralization, such as Tiktok, are able to design special aggregation algorithm, in a way that the individual information can be recovered completely from aggregated statistics, allowing them to exploit user preferences. In this thesis, we study the two aforementioned scenarios where aggregation fails to protect individual user's sensitive data. First, we show the hazard of re-identification of sensitive class labels caused by revealing a noisy sample mean of one class. With a novel formulation of the re-identification attack as a generalized positive-unlabeled learning problem, we prove that the risk function of the re-identification problem is closely related to that of learning with complete data. We demonstrate that with a one-sided noisy sample mean, an effective re-identification attack can be devised with existing PU learning algorithms. We then propose a novel algorithm, growPU, that exploits the unique property of sample mean and consistently outperforms existing PU learning algorithms on the re-identification task. GrowPU achieves re-identification accuracy of $93.6\%$ on the MNIST dataset and $88.1\%$ on an online behavioral dataset with noiseless sample mean. With noise that guarantees $0.01$-differential privacy, growPU achieves $91.9\%$ on the MNIST dataset and $84.6\%$ on the online behavioral dataset. In the second part of the thesis, we present a randomized article-push algorithm and a message-passing reconstruction algorithm, such that social media platforms are able to infer user preferences from only the publicly available aggregate data of article-reads, without storing any individual users' actions. Its $O(n)$ complexity allows the reconstruction algorithm to scale to a large population, as is typical of social media platforms. Moreover, the feasibility of the privacy attack depends on the algorithm using as few articles as possible. We determine the minimum number of articles needed for high probability inference. Given the proportion of users, $0<\eps<1$, who prefer a given topic, the push algorithm and reconstruction algorithm can achieve an article-to-user ratio $\beta=\sqrt{\eps(1-\eps)}$, at which phase transition occurs. By formulating the inference problem as a compressed sensing problem, we show that our phase transition threshold $\sqrt{\eps(1-\eps)}$ is extremely close to that of compressed sensing, even when the latter algorithm is of a worst-case $O(n^3)$ complexity and uses a dense Gaussian measurement matrix.
- Graduation Semester
- 2021-08
- Type of Resource
- Thesis
- Permalink
- http://hdl.handle.net/2142/113122
- Copyright and License Information
- Copyright 2021 Du Su
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisDissertations and Theses - Electrical and Computer Engineering
Dissertations and Theses in Electrical and Computer EngineeringManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…