Withdraw
Loading…
A conceptual model for transparent, reusable, and collaborative data cleaning
Parulian, Nikolaus Nova
Loading…
Permalink
https://hdl.handle.net/2142/121511
Description
- Title
- A conceptual model for transparent, reusable, and collaborative data cleaning
- Author(s)
- Parulian, Nikolaus Nova
- Issue Date
- 2023-07-13
- Director of Research (if dissertation) or Advisor (if thesis)
- Ludäscher, Bertram
- Doctoral Committee Chair(s)
- Ludäscher, Bertram
- Committee Member(s)
- Downie, John Stephen
- Diesner, Jana
- Bosch, Nigel
- Department of Study
- Information Sciences
- Discipline
- Information Sciences
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Data Cleaning
- Data Quality
- Provenance
- Data Preparation
- Machine Learning
- Artificial Intelligence
- Workflows
- Metadata
- Visualization
- Abstract
- Data cleaning is an essential component of data preparation in machine learning and other data science workflows. It is a time-consuming and error-prone task that can greatly affect the reliability of subsequent analyses. Tools must capture provenance information to ensure transparent and auditable data-cleaning processes. However, existing provenance models have limitations in tracing and querying changes at different levels of granularity. To address this, we proposed a new conceptual model that captures fine-grained retrospective provenance and extends it with prospective provenance to represent operations or workflows that change the datasets. This hybrid model allows powerful queries and supports advanced use cases like auditing data cleaning workflows. Additionally, we extended the model to present a conceptual model focusing on reusability and collaboration in data cleaning. It addresses scenarios where multiple users contribute to dataset changes and enables tracking of curator actions, identifying dependencies between cleaning operations, and facilitating collaboration. Through an experimental case study, we demonstrated the reusability of data-cleaning workflows, different users' contributions, and collaboration's effectiveness in improving data quality.
- Graduation Semester
- 2023-08
- Type of Resource
- Thesis
- Copyright and License Information
- Copyright 2023 Nikolaus Parulian
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…