Withdraw
Loading…
Accelerating discovery from high performance computing applications using descriptive metadata management
Lawson, Margaret
Loading…
Permalink
https://hdl.handle.net/2142/115510
Description
- Title
- Accelerating discovery from high performance computing applications using descriptive metadata management
- Author(s)
- Lawson, Margaret
- Issue Date
- 2022-03-09
- Director of Research (if dissertation) or Advisor (if thesis)
- Gropp, William
- Doctoral Committee Chair(s)
- Gropp, William
- Committee Member(s)
- Ludaescher, Bertram
- Winslett, Marianne
- Byna, Suren
- Lofstead, Jay
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- high performance computing
- descriptive metadata
- RDBMS
- databases
- spatial indexing
- Abstract
- High performance computing (HPC) scientists are producing unprecedented volumes of data that take a long time to load for analysis due both to the data’s size and the I/O bottleneck. This poses a significant obstacle on the path to discovery. However, we make the following observations: 1) many different types of scientific analyses only require loading in the data containing particular features of interest and 2) scientists have a wide range of approaches for identifying these features. Therefore, if we can store information (descriptive metadata) about features of interest that have been identified in-situ, in-transit, or during post- processing, then for all subsequent analyses we can use this information to only read in the data containing these features of interest. This can result in a dramatic reduction in the volume of data that scientists have to read in, thereby greatly accelerating analysis. Despite the potential benefits offered by descriptive metadata management, no prior work has created a descriptive metadata system that is designed to help scientists working with a wide range of applications, analyses and mesh types to restrict their reads to data containing features of interest. In this dissertation, we explore how to create the first such solution and use it to evaluate the following hypothesis: a general descriptive metadata management solution should be integrated into HPC workflows since it enables analyses that would otherwise be impossible by enabling scientists to only load in the data that is needed for analysis. To evaluate this hypothesis we first develop a prototype solution called EMPRESS that supports descriptive metadata for applications with regular rectangular or hexahedral meshes. For the evaluated cases, EMPRESS is able to accelerate analysis by 3–300× (depending on the fraction of the data that contains the feature of interest) while offering relatively high performance, good scalability and small storage overheads. However, EMPRESS is limited in that it offers lower speedups than we would expect given the reduction in the amount of data that needs to be read in for analysis, requires users to choose between more than 100 system configurations and to use an API with over 60 different functions and, most importantly, does not support applications with complex mesh types. In the remainder of this dissertation, we explore solutions to these problems. First we explore how to efficiently support applications with complex meshes by examining the difficult problem of how to efficiently map from descriptive metadata attributes describing features of interest to the associated data for complex meshes (metadata-data mapping). Through extensive evaluation, we discover that the following novel solution offers the best results: generating an in-memory mesh index using the R-tree from either the Boost Geometry or CGAL libraries. Using these mesh index implementations, we can support any mesh type. Next, we address the problem of how to enable efficient storage and retrieval of descriptive metadata while providing sufficient flexibility to support the wide range of applications and analyses used in HPC. We create a novel solution that provides support for extensible, user-defined metadata and utilizes a flat namespace, flexibly-typed columns, a domain-independent metadata model, and a storage schema that is designed for efficient index usage to provide the necessary combination of high performance and flexibility. We then perform a comprehensive analysis to determine which backend store can best implement our solution. We find that SQLite is the best choice, offering the best overall performance, scalability and storage overheads. Finally, we present EMPRESSA, the first general descriptive metadata solution. EMPRESSA incorporates our novel solutions to the problems of how to efficiently perform metadata-data mapping for complex meshes and how to provide efficient yet flexible metadata storage and retrieval. Our evaluation demonstrates that EMPRESSA is able to produce speedups that are determined by the fraction of data that contains a feature of interest while offering high performance, good scalability, and minimal storage overheads. We thus demonstrate that a general descriptive metadata management should be integrated into HPC workflows since it enables analyses that would otherwise be impossible by enabling scientists to only load in the data that is needed for analysis. EMPRESSA is a production-ready implementation of this solution that scientists can integrate into their workflows today due to its free, permissive open-source license. Given EMPRESSA’s proven scalability and broader technological trends, the benefits offered should increase in the exascale era.
- Graduation Semester
- 2022-05
- Type of Resource
- Thesis
- Copyright and License Information
- Copyright 2022 Margaret Lawson
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…