Understanding and mitigating privacy risk in machine learning systems

Long, Yunhui

Understanding and mitigating privacy risk in machine learning systems

Long, Yunhui

Content Files

LONG-DISSERTATION-2020.pdf

Permalink

https://hdl.handle.net/2142/107972

Description

Title

Understanding and mitigating privacy risk in machine learning systems

Author(s)

Long, Yunhui

Issue Date

2020-05-04

Director of Research (if dissertation) or Advisor (if thesis)

Gunter, Carl A

Doctoral Committee Chair(s)

Gunter, Carl A

Committee Member(s)

Zhai, ChengXiang
Li, Bo
Shokri, Reza

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Date of Ingest

2020-08-26T21:54:49Z

Keyword(s)

privacy
machine learning

Abstract

Recent years have witnessed a rapid development in machine learning systems and a widespread increase of machine learning applications. However, with the widespread adoption of machine learning, privacy issues have emerged. This thesis studies the privacy risk in modern machine learning systems in two ways. First, we improve the understanding on machine learning privacy through attacks and measurements. Due to the increasing complexity and lack of transparency of state-of-art machine learning models, it is challenging to understand what information a model learns from its training data and whether the information could be leaked through the model’s predictions. Therefore, we design various attacks to infer different information from machine learning models trained on sensitive data. By analyzing the performance of these attacks, we get a better understanding on the privacy risk of sharing these models. Second, we propose different levels of protection mechanisms to balance between privacy and data utility. We divide the use of sensitive data in a modern machine learning system into three levels based on the trade-off between data utility and privacy protection. At the first level, we consider data with high utility requirement and relatively low privacy protection, such as system logs with heterogeneous data of high dimensionality. This type of data is very sensitive to noise injection, making it challenging to achieve strong privacy guarantee without incurring great loss on data utility. To address this problem, we propose empirical protections based on hypothesis tests. Our approach uses various hypothesis tests to identify potential information leakage from the data and adds the minimum amount of noise sufficient to mitigate the identified risks. Although this approach does not provide strong theoretical guarantee, it allows users to share their data with higher confidence and with minimum utility loss. At the second level, we consider sensitive data that need to be shared for general purposes. For example, datasets containing personal photos can be used in a wide range of applications including face recognition, human pose extraction, and mood detection. However, these photos are also extremely sensitive since they contain a lot of privacy information. For this type of data, it is important to maintain a proper balance between privacy and data utility. On the one hand, due to the sensitive nature of the data, it is necessary to apply rigorous privacy protections such as differential privacy. On the other hand, to allow multiple applications to use the released data, the privacy protection mechanisms need to preserve the original data distribution to the maximum extent possible. Based on these requirements, we design a novel approach G-PATE for training a scalable differentially private data generator, which can be used to produce synthetic datasets with strong privacy guarantee while preserving high data utility. At the third level, we consider sensitive data that are useful for specific applications. For this type of data, it is often not necessary to share the original dataset. Instead, data owners can share differentially private machine learning models tailored to the need of the applications. By only sharing the models, we limit the use of the sensitive data to the approved applications while improving model utility under the same privacy guarantee. As an example, in this thesis, we propose the first differentially private graph convolutional network (DP-GCN). By guaranteeing edge-differential privacy, DP-GCN allows users to analyze graph-structured data without leaking the sensitive connection information, such as private real-life connections in social networks.

Graduation Semester

2020-05

Type of Resource

Thesis

Permalink

http://hdl.handle.net/2142/107972

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Computer Science

Dissertations and Theses from the Siebel School of Computer Science

Understanding and mitigating privacy risk in machine learning systems

Long, Yunhui

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Computer Science

Log In