Withdraw
Loading…
Optimizing performance under thermal and power constraints for HPC data centers
Sarood, Osman
Loading…
Permalink
https://hdl.handle.net/2142/49478
Description
- Title
- Optimizing performance under thermal and power constraints for HPC data centers
- Author(s)
- Sarood, Osman
- Issue Date
- 2014-05-30T16:46:14Z
- Director of Research (if dissertation) or Advisor (if thesis)
- Kale, Laxmikant V.
- Doctoral Committee Chair(s)
- Kale, Laxmikant V.
- Committee Member(s)
- de Supinski, Bronis
- Garzaran, Maria J.
- Abdelzaher, Tarek F.
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- green computing
- load balancing
- energy efficiency
- High performance computing (HPC) application
- power constraint
- power cap
- Performance optimization
- thermal constraint
- temperature aware load balancing
- frequency aware load balancing
- fault tolerance
- improving reliability
- Abstract
- Energy, power and resilience are the major challenges that the HPC community faces in moving to larger supercomputers. Data centers worldwide consumed energy equivalent to 235 billion kWh in 2010. A significant portion of that energy and power consumption is devoted to cooling. This thesis proposes a scheme based on a combination of limiting processor temperatures using Dynamic Voltage and Frequency Scaling (DVFS) and frequency-aware load balancing that reduces cooling energy consumption and prevents hot spot formation. Recent reports have expressed concern that reliability at the exascale level could degrade to the point where failures become a norm rather than an exception. HPC researchers are focusing on improving existing fault tolerance protocols to address these concerns. Research on improving hardware reliability has also been making progress independently. A second component of this thesis tries to bridge this gap and explore the potential of combining both software and hardware aspects towards improving reliability of HPC machines. Finally, the 10MW consumption of present day HPC systems is certainly becoming a bottleneck. Although energy bills will significantly increase with machine size, power consumption is a hard constraint that must be addressed. Intel’s Running Average Power Limit (RAPL) toolkit is a recent feature that enables power capping of CPU and memory subsystems on modern hardware. The ability to constrain the maximum power consumption of the subsystems below the vendor-assigned Thermal Design Point (TDP) value allows us to add more nodes in an overprovisioned system while ensuring that the total power consumption of the data center does not exceed its power budget. The final component of this thesis proposes an interpolation scheme that uses an application profile to optimize the number of nodes and distribution of power between CPU and memory subsystems that minimizes execution time under a strict power budget. We also present a resource management scheme including a scheduler that uses CPU power capping, hardware overprovisioning, and job malleability to improve the throughput of a data center under a strict power budget.
- Graduation Semester
- 2014-05
- Permalink
- http://hdl.handle.net/2142/49478
- Copyright and License Information
- Copyright 2014 Osman Sarood
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisDissertations and Theses - Computer Science
Dissertations and Theses from the Dept. of Computer ScienceManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…