Withdraw
Loading…
Timed execution in distributed machine learning
Hashemi, Sayed Hadi
Loading…
Permalink
https://hdl.handle.net/2142/107856
Description
- Title
- Timed execution in distributed machine learning
- Author(s)
- Hashemi, Sayed Hadi
- Issue Date
- 2020-05-08
- Director of Research (if dissertation) or Advisor (if thesis)
- Campbell, Roy H
- Doctoral Committee Chair(s)
- Campbell, Roy H
- Committee Member(s)
- Kindratenko, Volodymyr
- Gropp, William D
- Godfrey, Philip B
- Diamos, Gregory F
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- Ph.D.
- Degree Level
- Dissertation
- Keyword(s)
- Machine Learning Systems, Distributed Systems, Deep Learning, Deep Neural Networks, TensorFlow, Scheduling
- Abstract
- Deep learning powers many transformative core technologies including Autonomous Driving, Natural Language Translation, and Automatic Medical Diagnosis. Its exceptional ability to extract intricate structures from high-dimensional data takes the credit for major advances in machine learning. Essential ingredients that make Deep Learning possible include: the availability of a massive curated data, a well-designed model, and readily available high-performance computation. The computation used in training deep neural networks has doubled every 3.4 months since 2012, five times faster than Moore's law. Fulfilling this massive computational demand that has long outgrown the capability of a single high-end node is vital to keep extending the flow of innovations. For example, in 2018, the AlphaGoZero model trained with 1.89 ExaFlops/s times a day. The state-of-the-art GPU at the time, NVidia V-100, could only deliver 125 TeraFlops. In a meanwhile, Summit, the fastest supercomputer in the world, could sustain 1 ExaFlops/s using 27,360 NVidia V-100 GPUs through distributed computation. This dissertation studies the challenges of scaling out an ML job to keep up with the computational demand, specifically the problems stemmed from the complex interplay of various resources in the distributed ML. In the first stage of this research, we developed methods and systems to properly observe a distributed ML environment by tracing a distributed execution, visualizing the results, and expanding the observability to the production infrastructure. Later we developed three systems to address scalability challenges using these methods and systems based on a precise execution timing of the spectrum of resources: Network: TicTac reduces internal computation delays by enforcing a near-optimal order on network transfers, which results in up to 37.7% throughput increases. Computation: Caramel increases the network utilization and decreases network congestion by modifying the order of computation and choosing the most fitted collective primitive for the workload. This result in cutting the training time up by a factor of up to $3.8$. While computation and network scheduling suggest an order of execution, TimedRPC addresses the issue of correctly enforcing this order by implementing a priority-based scheduling through pre-emption where an on-going transfer can be paused when a transfer with higher priority is requested. I/O: Diot maximizes the I/O throughput by tuning knobs such as number of concurrent I/O requests and read size on I/O pipeline. Additionally it detects the I/O delivery unfairness which may causes a struggling worker due to slow I/O through. Thesis Statement: Heuristic timing of distributed machine learning execution leads to utilization optimizations for computation, network, and storage, which in turn improves the overall throughput. In general, any multi-resource optimization involving parallelism is at least NP-Hard.
- Graduation Semester
- 2020-05
- Type of Resource
- Thesis
- Permalink
- http://hdl.handle.net/2142/107856
- Copyright and License Information
- Copyright 2020 Sayed Hadi Hashemi
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisDissertations and Theses - Computer Science
Dissertations and Theses from the Dept. of Computer ScienceManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…