Withdraw
Loading…
Leveraging distributional context for safe and interactive data science at scale
Macke, Stephen Thomas
Loading…
Permalink
https://hdl.handle.net/2142/113024
Description
- Title
- Leveraging distributional context for safe and interactive data science at scale
- Author(s)
- Macke, Stephen Thomas
- Issue Date
- 2021-07-14
- Director of Research (if dissertation) or Advisor (if thesis)
- Parameswaran, Aditya
- Doctoral Committee Chair(s)
- Parameswaran, Aditya
- Committee Member(s)
- Sundaram, Hari
- Tong, Hanghang
- Beutel, Alex
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- data science
- visualization
- EDA
- computational notebooks
- approximate query processing
- Abstract
- "Data science is an iterative, exploratory, and ad-hoc process performed by individuals and teams possessing increasingly varied backgrounds and skill-sets. As such, we need data science to be interactive, so that data scientists are not bottlenecked when trying out new hypotheses or confirming existing ones. Moreover, data science must be safe, ensuring that data scientists, especially those with limited programming or analysis experience, avoid making incorrect inferences. Safety and interactivity are typically at odds with one another, since various notions of safety often eschew ""shortcuts"" that make working with large-scale data tractable. In this dissertation, we aim to meet the dual objectives of interactivity and safety at scale by leveraging distributional context—specifically the distributions of the data and the operations performed by data scientists. We apply this ""recipe"" to five different key data science settings: (i) for machine learning development, we provide context-aware caching algorithms that allow model developers to benefit from interactive iteration times during model development, while not requiring error-prone manual tracking of reusable intermediates; (ii) for visualization search, we develop context-aware sampling algorithms that support interactive search for patterns in visualizations, while ensuring that the results meet rigorous quality guarantees; (iii) for browsing, we develop workload-aware learned Bloom filters optimized for multidimensional data that allow analysts to quickly identify records that have been examined before, all while guarding against false negatives; (iv) for report generation, we develop context-aware aggregate approximation algorithms that provide rigorous distribution-aware confidence intervals around aggregates, while ensuring that the intervals are ""tighter"", allowing analysts to make decisions sooner; and (v) finally, for error-prone interactions in computational notebooks, we demonstrate approximate lineage-capture techniques that warn data scientists of unsafe cell executions for many cases encountered in practice."
- Graduation Semester
- 2021-08
- Type of Resource
- Thesis
- Permalink
- http://hdl.handle.net/2142/113024
- Copyright and License Information
- Copyright 2021 Stephen Thomas Macke
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisDissertations and Theses - Computer Science
Dissertations and Theses from the Dept. of Computer ScienceManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…