Accelerating graph attention network inference on CPUs with layer fusion

Yao, Yao

Accelerating graph attention network inference on CPUs with layer fusion

Yao, Yao

This item's files can only be accessed by the System Administrators group.

Permalink

https://hdl.handle.net/2142/124697

Description

Title

Accelerating graph attention network inference on CPUs with layer fusion

Author(s)

Yao, Yao

Issue Date

2024-04-29

Director of Research (if dissertation) or Advisor (if thesis)

Torrellas, Josep

Department of Study

Computer Science

Discipline

Computer Science

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

M.S.

Degree Level

Thesis

Keyword(s)

Graph Attention Networks
CPU
Layer Fusion

Abstract

Graphs are becoming a more and more popular data structure being used in many fields. Recently, Graph Attention Network (GAT), a special type of Graph Neural Network (GNN), has emerged as a powerful tool for processing graph-structured data, offering state-of-the-art performance for graph-related tasks like node classification. Existing works mostly focus on domain specific accelerators to optimize GAT inference. However, we believe that CPU is also an attractive choice for GAT inference because it is widely available and offers large memory capacity. Layer Fusion is a technique introduced in Graphite [35] that combines the memory-intensive aggregation and compute-intensive update phases in a GNN layer to overlap memory accesses with computation, thereby reducing memory stress when executing GNN workloads on CPUs. However, while this technique benefits general GNN models, it does not directly apply to GATs due to their additional attention calculation phase. We posit that this increased complexity in GAT presents a great opportunity for optimization using Layer Fusion techniques. By fusing the attention calculation and aggregation phases, we can overlap memory accesses with computation, thereby reducing DRAM traffic as well as execution time. Hence, this thesis work is motivated to explore how Layer Fusion can optimize GAT inference on CPUs. The thesis begins with an overview of the research context, providing a historical perspective on GAT evolution and the rationale for focusing on accelerating GAT inference on CPUs with Layer Fusion. It then explores the theoretical background of GATs and details their implementation in the DGL framework, serving as the baseline for comparison. Methodologies for incorporating Layer Fusion into GATs are discussed, including three variations that differ in the placement of the attention head iteration. Experimental results comparing the Layer Fusion approach against the DGL baseline show significant improvements in execution times across various datasets, with up to a 2.81x speedup. Sensitivity analyses explore the impact of factors like the number of attention heads and graph characteristics on performance, providing insights into the performance improvement achieved by the Layer Fusion approach.

Graduation Semester

2024-05

Type of Resource

Thesis

Handle URL

https://hdl.handle.net/2142/124697

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Accelerating graph attention network inference on CPUs with layer fusion

Yao, Yao

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Log In