Auto-parallelization of machine-learning dataflow graphs for CPU multicores
Das, Srinjoy
This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.
Permalink
https://hdl.handle.net/2142/121343
Description
Title
Auto-parallelization of machine-learning dataflow graphs for CPU multicores
Author(s)
Das, Srinjoy
Issue Date
2023-07-17
Director of Research (if dissertation) or Advisor (if thesis)
Rauchwerger, Lawrence
Department of Study
Computer Science
Discipline
Computer Science
Degree Granting Institution
University of Illinois at Urbana-Champaign
Degree Name
M.S.
Degree Level
Thesis
Keyword(s)
Parallelization
Clustering
Machine Learning
Graph optimization
Compiler optimization
Dataflow Graph
Inference
Multicores
Pytorch
Abstract
Several methods exist today to accelerate Machine Learning(ML)/Deep-Learning(DL) model performance for training and inference. However, modern techniques that rely on various graph and operator parallelism methodologies rely on search space optimizations which are costly in terms of power and hardware usage. Especially in the case of inference, when the batch size is 1 and execution is on Central Processing Units (CPUs) or at the edge, current techniques can become costly, complicated or inapplicable. To ameliorate this, we present a Critical-Path-based Linear Clustering approach to exploit inherent parallel paths in ML dataflow graphs. We augment this with a new hyperclustering mechanism for small batch sizes > 1 which may be typical in inference scenarios. Our task parallelization approach further optimizes the structure of graphs via cloning and simplifies them via dead-code elimination. Contrary to other work, we generate readable and executable parallel Pytorch+Python code from input ONNX models via a new tool that we have built called Ramiel which allows us to benefit from other downstream acceleration techniques like intra-op parallelism and potentially pipeline parallelism. Our preliminary results on several ML graphs demonstrate up to 1.9× speedup over serial execution and outperform some of the current mechanisms in both compile and runtimes. Lastly, our methods are lightweight and fast enough so that they can be used effectively for Artificial Intelligence (AI) at the edge.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.