Learning-based scheduling for ray-based Hybrid HPC-Cloud Systems
Lu, Yicheng
This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.
Permalink
https://hdl.handle.net/2142/124562
Description
Title
Learning-based scheduling for ray-based Hybrid HPC-Cloud Systems
Author(s)
Lu, Yicheng
Issue Date
2024-04-30
Director of Research (if dissertation) or Advisor (if thesis)
Kindratenko, Volodymyr
Department of Study
Computer Science
Discipline
Computer Science
Degree Granting Institution
University of Illinois at Urbana-Champaign
Degree Name
M.S.
Degree Level
Thesis
Keyword(s)
Cloud bursting
HPC
Data movement
Scheduling
Abstract
Hybrid HPC-Cloud systems are becoming increasingly popular within the scientific community for their ability to efficiently manage sudden increases in demand, thus improving the processing times of HPC workloads. However, current systems lack efficient workload scheduling strategies to suit these hybrid environments and face considerable deployment challenges due to intricate configurations required, particularly concerning data transfer between HPC and the cloud. To address these issues, we have developed an innovative HPC-Cloud bursting system using Ray, a well-known open-source distributed framework. Our system adopts a learning-based scheduling approach at the function level through a dynamic label-based architecture and automatically manages data movement between the cloud and HPC. Specifically, our scheduler proactively prefetches data based on anticipated demand and analyzes patterns of data movement and task execution to inform future scheduling decisions. Our system significantly improves the processing times of HPC workloads by hiding data transfer time and employing high-quality, learning-based scheduling decisions. We evaluated our system with two different workloads: machine learning model training and image processing. We conducted performance comparisons using conventional data retrieval methods and the default Ray scheduler under various network conditions and storage configurations. Our findings consistently show that our system significantly outperforms traditional methods in every tested scenario.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.