Withdraw
Loading…
Privacy-preserving seedbased data synthesis
Bindschadler, Vincent
Loading…
Permalink
https://hdl.handle.net/2142/101661
Description
- Title
- Privacy-preserving seedbased data synthesis
- Author(s)
- Bindschadler, Vincent
- Issue Date
- 2018-07-02
- Director of Research (if dissertation) or Advisor (if thesis)
- Gunter, Carl A.
- Doctoral Committee Chair(s)
- Gunter, Carl A.
- Committee Member(s)
- Zhai, ChengXiang
- Borisov, Nikita
- Smith, Adam D
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Privacy
- Data Privacy
- Synthetic Data
- Abstract
- How can we share sensitive datasets in such a way as to maximize utility while simultaneously safeguarding privacy? This thesis proposes an answer to this question by re-framing the problem of sharing sensitive datasets as a data synthesis task. Specifically, we propose a framework to synthesize full data records in a privacy-preserving way so that they can be shared instead of the original sensitive data. The core the framework is a technique called seedbased data synthesis. Seedbased data synthesis produces data records by conditioning the output of a generative model on some input data record called the seed. This technique produces synthetic records that are similar to their seeds, which results in high quality outputs. But it simultaneously introduces statistical dependence between synthetic records and their seeds, which may compromise privacy. As a countermeasure, we introduce a new class of techniques that can achieve strong privacy notions in this setting: privacy tests. Privacy tests are algorithms that probabilistically reject candidate synthetics records which are determined to leak sensitive information. Synthetic records that fail the test are simply discarded, whereas those that pass the test are deemed safe and included in the synthetic dataset to be shared. We design two privacy tests that provably yield differential privacy. We analyze the quality of synthetic datasets based on a cryptography-inspired definition of distinguishability: if synthetic data records are indistinguishable from real records, then they are (by definition) as useful as real data. On the theory front, we characterize the utility-privacy trade-off of seedbased data synthesis. On the experimental front, we design an efficient procedure to experimentally quantify distinguishability. We experimentally validate the seedbased data synthesis framework using five probabilistic generative models. Specifically, using real-world datasets as input, we produce synthetic data records for four different application scenarios and data types: location trajectories, census microdata, medical data, and facial images. We evaluate the quality of the produced synthetic records using both application-dependent utility metrics and distinguishability, and show that the framework is capable of producing highly realistic synthetic data records while providing differential privacy for conservative parameters.
- Graduation Semester
- 2018-08
- Type of Resource
- text
- Permalink
- http://hdl.handle.net/2142/101661
- Copyright and License Information
- Copyright 2018 Vincent Bindschadler
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…