Leveraging distributional context for safe and interactive data science at scale

Macke, Stephen Thomas

Leveraging distributional context for safe and interactive data science at scale

Macke, Stephen Thomas

Content Files

MACKE-DISSERTATION-2021.pdf

Permalink

https://hdl.handle.net/2142/113024

Description

Title

Leveraging distributional context for safe and interactive data science at scale

Author(s)

Macke, Stephen Thomas

Issue Date

2021-07-14

Director of Research (if dissertation) or Advisor (if thesis)

Parameswaran, Aditya

Doctoral Committee Chair(s)

Parameswaran, Aditya

Committee Member(s)

Sundaram, Hari
Tong, Hanghang
Beutel, Alex

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Date of Ingest

2022-01-12T21:45:39Z

Keyword(s)

data science
visualization
EDA
computational notebooks
approximate query processing

Abstract

"Data science is an iterative, exploratory, and ad-hoc process performed by individuals and teams possessing increasingly varied backgrounds and skill-sets. As such, we need data science to be interactive, so that data scientists are not bottlenecked when trying out new hypotheses or confirming existing ones. Moreover, data science must be safe, ensuring that data scientists, especially those with limited programming or analysis experience, avoid making incorrect inferences. Safety and interactivity are typically at odds with one another, since various notions of safety often eschew ""shortcuts"" that make working with large-scale data tractable. In this dissertation, we aim to meet the dual objectives of interactivity and safety at scale by leveraging distributional context—specifically the distributions of the data and the operations performed by data scientists. We apply this ""recipe"" to five different key data science settings: (i) for machine learning development, we provide context-aware caching algorithms that allow model developers to benefit from interactive iteration times during model development, while not requiring error-prone manual tracking of reusable intermediates; (ii) for visualization search, we develop context-aware sampling algorithms that support interactive search for patterns in visualizations, while ensuring that the results meet rigorous quality guarantees; (iii) for browsing, we develop workload-aware learned Bloom filters optimized for multidimensional data that allow analysts to quickly identify records that have been examined before, all while guarding against false negatives; (iv) for report generation, we develop context-aware aggregate approximation algorithms that provide rigorous distribution-aware confidence intervals around aggregates, while ensuring that the intervals are ""tighter"", allowing analysts to make decisions sooner; and (v) finally, for error-prone interactions in computational notebooks, we demonstrate approximate lineage-capture techniques that warn data scientists of unsafe cell executions for many cases encountered in practice."

Graduation Semester

2021-08

Type of Resource

Thesis

Permalink

http://hdl.handle.net/2142/113024

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Computer Science

Dissertations and Theses from the Siebel School of Computer Science

Leveraging distributional context for safe and interactive data science at scale

Macke, Stephen Thomas

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Computer Science

Log In