Withdraw
Loading…
New capabilities for large-scale exploratory data analysis
Xu, Liqi
Loading…
Permalink
https://hdl.handle.net/2142/107971
Description
- Title
- New capabilities for large-scale exploratory data analysis
- Author(s)
- Xu, Liqi
- Issue Date
- 2020-05-04
- Director of Research (if dissertation) or Advisor (if thesis)
- Parameswaran, Aditya
- Doctoral Committee Chair(s)
- Parameswaran, Aditya
- Committee Member(s)
- Zhai, ChengXiang
- Tao, Xie
- Cole, Richard L.
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Exploratory Data Analysis
- Data Management
- Abstract
- The ever-rising diversity of data generated, manipulated, and analyzed every day engenders a variety of data formats, ranging from one fixed dataset to multiple versions of a dataset stored across multiple data sources. This variety of formats has led to substantial challenges in data exploration. Existing systems do not effectively support querying capabilities across these formats: (i) Browsing: When exploring a single dataset, data scientists often need to examine a collection of records that satisfy arbitrary predicates. However, current exploratory data analysis tools mainly focus on visual summarization over browsing. (ii) Versioning: With the proliferation of dataset versions generated during different stages of exploration, exploratory data analysis is no longer just about exploring one static dataset. Instead, data scientists need to keep track of massive numbers of versions, as well as search for versions with specific criteria. (iii) Integrating: Nowadays, datasets are collected and stored at multiple sources (e.g., as part of the IoT). When exploring data, data scientists often need to query and join data across databases at disparate locations. In this dissertation, we propose systems that enable query capabilities to efficiently and effectively fulfill these new demands in data exploration. (i) For browsing, we develop NEEDLETAIL, a data exploration engine that employs a light-weight indexing structure along with efficient algorithms to retrieve any-k valid records for arbitrary queries as quickly as possible. (ii) For versioning, we implement and open-source ORPHEUSDB, a dataset version control system that can efficiently track and query across dataset versions. Since versioning queries in ORPHEUSDB take advantage of array operators in relational database systems, we also conduct an extensive experimental study on understanding array implementations in modern database systems. (iii) For integrating, we leverage machine learning techniques to optimize federated query processing and eventually improve the interactivity of data exploration across disparate databases.
- Graduation Semester
- 2020-05
- Type of Resource
- Thesis
- Permalink
- http://hdl.handle.net/2142/107971
- Copyright and License Information
- Copyright 2020 Liqi Xu
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisDissertations and Theses - Computer Science
Dissertations and Theses from the Dept. of Computer ScienceManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…