Privacy-preserving seedbased data synthesis

Bindschadler, Vincent

Privacy-preserving seedbased data synthesis

Bindschadler, Vincent

Permalink

https://hdl.handle.net/2142/101661

Description

Title

Privacy-preserving seedbased data synthesis

Author(s)

Bindschadler, Vincent

Issue Date

2018-07-02

Director of Research (if dissertation) or Advisor (if thesis)

Gunter, Carl A.

Doctoral Committee Chair(s)

Gunter, Carl A.

Committee Member(s)

Zhai, ChengXiang
Borisov, Nikita
Smith, Adam D

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

Privacy
Data Privacy
Synthetic Data

Abstract

How can we share sensitive datasets in such a way as to maximize utility while simultaneously safeguarding privacy? This thesis proposes an answer to this question by re-framing the problem of sharing sensitive datasets as a data synthesis task. Specifically, we propose a framework to synthesize full data records in a privacy-preserving way so that they can be shared instead of the original sensitive data. The core the framework is a technique called seedbased data synthesis. Seedbased data synthesis produces data records by conditioning the output of a generative model on some input data record called the seed. This technique produces synthetic records that are similar to their seeds, which results in high quality outputs. But it simultaneously introduces statistical dependence between synthetic records and their seeds, which may compromise privacy. As a countermeasure, we introduce a new class of techniques that can achieve strong privacy notions in this setting: privacy tests. Privacy tests are algorithms that probabilistically reject candidate synthetics records which are determined to leak sensitive information. Synthetic records that fail the test are simply discarded, whereas those that pass the test are deemed safe and included in the synthetic dataset to be shared. We design two privacy tests that provably yield differential privacy. We analyze the quality of synthetic datasets based on a cryptography-inspired definition of distinguishability: if synthetic data records are indistinguishable from real records, then they are (by definition) as useful as real data. On the theory front, we characterize the utility-privacy trade-off of seedbased data synthesis. On the experimental front, we design an efficient procedure to experimentally quantify distinguishability. We experimentally validate the seedbased data synthesis framework using five probabilistic generative models. Specifically, using real-world datasets as input, we produce synthetic data records for four different application scenarios and data types: location trajectories, census microdata, medical data, and facial images. We evaluate the quality of the produced synthetic records using both application-dependent utility metrics and distinguishability, and show that the framework is capable of producing highly realistic synthetic data records while providing differential privacy for conservative parameters.

Graduation Semester

2018-08

Type of Resource

text

Permalink

http://hdl.handle.net/2142/101661

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Privacy-preserving seedbased data synthesis

Bindschadler, Vincent

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Log In