Attacks on medical data and centralized social platforms: PU learning and graphical inference

Su, Du

Attacks on medical data and centralized social platforms: PU learning and graphical inference

Su, Du

Permalink

https://hdl.handle.net/2142/113122

Description

Title

Attacks on medical data and centralized social platforms: PU learning and graphical inference

Author(s)

Su, Du

Issue Date

2021-06-15

Director of Research (if dissertation) or Advisor (if thesis)

Lu, Yi

Doctoral Committee Chair(s)

Lu, Yi

Committee Member(s)

Hajek, Bruce
Srikant, Rayadurgam
Viswanath, Pramod

Department of Study

Electrical & Computer Eng

Discipline

Electrical & Computer Engr

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

privacy
social network
data-mining
density evolution
neural networks

Abstract

Sensitive information in user data, such as health status, financial history, and personal preference is facing a high risk of being exposed to adversarial entities, as well as being abused by the service provider. Prohibiting individual-level operations on user records is one natural proposal to protect the sensitive information of user data. Instead, only aggregate statistics of sensitive data are allowed to be stored or released. It is widely believed that aggregation is an effective way of preserving the typical characteristics of users for data mining tasks while preventing the infringement of individual privacy, as the aggregated statistics carry little information of each individual user. However, these privacy-preserving techniques may fail to eliminate the risks of sensitive data leakage. Aggregation is not able to fully hide features of the dataset, and adversarial entities are able to exploit the features of aggregate data to recover the sensitive information in user records. Moreover, social platforms with a high degree of centralization, such as Tiktok, are able to design special aggregation algorithm, in a way that the individual information can be recovered completely from aggregated statistics, allowing them to exploit user preferences. In this thesis, we study the two aforementioned scenarios where aggregation fails to protect individual user's sensitive data. First, we show the hazard of re-identification of sensitive class labels caused by revealing a noisy sample mean of one class. With a novel formulation of the re-identification attack as a generalized positive-unlabeled learning problem, we prove that the risk function of the re-identification problem is closely related to that of learning with complete data. We demonstrate that with a one-sided noisy sample mean, an effective re-identification attack can be devised with existing PU learning algorithms. We then propose a novel algorithm, growPU, that exploits the unique property of sample mean and consistently outperforms existing PU learning algorithms on the re-identification task. GrowPU achieves re-identification accuracy of $93.6\%$ on the MNIST dataset and $88.1\%$ on an online behavioral dataset with noiseless sample mean. With noise that guarantees $0.01$-differential privacy, growPU achieves $91.9\%$ on the MNIST dataset and $84.6\%$ on the online behavioral dataset. In the second part of the thesis, we present a randomized article-push algorithm and a message-passing reconstruction algorithm, such that social media platforms are able to infer user preferences from only the publicly available aggregate data of article-reads, without storing any individual users' actions. Its $O(n)$ complexity allows the reconstruction algorithm to scale to a large population, as is typical of social media platforms. Moreover, the feasibility of the privacy attack depends on the algorithm using as few articles as possible. We determine the minimum number of articles needed for high probability inference. Given the proportion of users, $0<\eps<1$, who prefer a given topic, the push algorithm and reconstruction algorithm can achieve an article-to-user ratio $\beta=\sqrt{\eps(1-\eps)}$, at which phase transition occurs. By formulating the inference problem as a compressed sensing problem, we show that our phase transition threshold $\sqrt{\eps(1-\eps)}$ is extremely close to that of compressed sensing, even when the latter algorithm is of a worst-case $O(n^3)$ complexity and uses a dense Gaussian measurement matrix.

Graduation Semester

2021-08

Type of Resource

Thesis

Permalink

http://hdl.handle.net/2142/113122

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Electrical and Computer Engineering

Dissertations and Theses in Electrical and Computer Engineering

Attacks on medical data and centralized social platforms: PU learning and graphical inference

Su, Du

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Electrical and Computer Engineering

Log In