Accelerating discovery from high performance computing applications using descriptive metadata management

Lawson, Margaret

Accelerating discovery from high performance computing applications using descriptive metadata management

Lawson, Margaret

Permalink

https://hdl.handle.net/2142/115510

Description

Title

Accelerating discovery from high performance computing applications using descriptive metadata management

Author(s)

Lawson, Margaret

Issue Date

2022-03-09

Director of Research (if dissertation) or Advisor (if thesis)

Gropp, William

Doctoral Committee Chair(s)

Gropp, William

Committee Member(s)

Ludaescher, Bertram
Winslett, Marianne
Byna, Suren
Lofstead, Jay

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

Ph.D.

Degree Level

Dissertation

Keyword(s)

high performance computing
descriptive metadata
RDBMS
databases
spatial indexing

Abstract

High performance computing (HPC) scientists are producing unprecedented volumes of data that take a long time to load for analysis due both to the data’s size and the I/O bottleneck. This poses a significant obstacle on the path to discovery. However, we make the following observations: 1) many different types of scientific analyses only require loading in the data containing particular features of interest and 2) scientists have a wide range of approaches for identifying these features. Therefore, if we can store information (descriptive metadata) about features of interest that have been identified in-situ, in-transit, or during post- processing, then for all subsequent analyses we can use this information to only read in the data containing these features of interest. This can result in a dramatic reduction in the volume of data that scientists have to read in, thereby greatly accelerating analysis. Despite the potential benefits offered by descriptive metadata management, no prior work has created a descriptive metadata system that is designed to help scientists working with a wide range of applications, analyses and mesh types to restrict their reads to data containing features of interest. In this dissertation, we explore how to create the first such solution and use it to evaluate the following hypothesis: a general descriptive metadata management solution should be integrated into HPC workflows since it enables analyses that would otherwise be impossible by enabling scientists to only load in the data that is needed for analysis. To evaluate this hypothesis we first develop a prototype solution called EMPRESS that supports descriptive metadata for applications with regular rectangular or hexahedral meshes. For the evaluated cases, EMPRESS is able to accelerate analysis by 3–300× (depending on the fraction of the data that contains the feature of interest) while offering relatively high performance, good scalability and small storage overheads. However, EMPRESS is limited in that it offers lower speedups than we would expect given the reduction in the amount of data that needs to be read in for analysis, requires users to choose between more than 100 system configurations and to use an API with over 60 different functions and, most importantly, does not support applications with complex mesh types. In the remainder of this dissertation, we explore solutions to these problems. First we explore how to efficiently support applications with complex meshes by examining the difficult problem of how to efficiently map from descriptive metadata attributes describing features of interest to the associated data for complex meshes (metadata-data mapping). Through extensive evaluation, we discover that the following novel solution offers the best results: generating an in-memory mesh index using the R-tree from either the Boost Geometry or CGAL libraries. Using these mesh index implementations, we can support any mesh type. Next, we address the problem of how to enable efficient storage and retrieval of descriptive metadata while providing sufficient flexibility to support the wide range of applications and analyses used in HPC. We create a novel solution that provides support for extensible, user-defined metadata and utilizes a flat namespace, flexibly-typed columns, a domain-independent metadata model, and a storage schema that is designed for efficient index usage to provide the necessary combination of high performance and flexibility. We then perform a comprehensive analysis to determine which backend store can best implement our solution. We find that SQLite is the best choice, offering the best overall performance, scalability and storage overheads. Finally, we present EMPRESSA, the first general descriptive metadata solution. EMPRESSA incorporates our novel solutions to the problems of how to efficiently perform metadata-data mapping for complex meshes and how to provide efficient yet flexible metadata storage and retrieval. Our evaluation demonstrates that EMPRESSA is able to produce speedups that are determined by the fraction of data that contains a feature of interest while offering high performance, good scalability, and minimal storage overheads. We thus demonstrate that a general descriptive metadata management should be integrated into HPC workflows since it enables analyses that would otherwise be impossible by enabling scientists to only load in the data that is needed for analysis. EMPRESSA is a production-ready implementation of this solution that scientists can integrate into their workflows today due to its free, permissive open-source license. Given EMPRESSA’s proven scalability and broader technological trends, the benefits offered should increase in the exascale era.

Graduation Semester

2022-05

Type of Resource

Thesis

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Accelerating discovery from high performance computing applications using descriptive metadata management

Lawson, Margaret

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Log In