Sven: Communication-Redundancy-Free Training Framework for Distributed Temporal Graph Neural Networks

Research Summary

As Graph Neural Networks (GNNs) extend to dynamic graph data, Temporal Graph Neural Networks (TGNNs) have demonstrated exceptional capabilities in handling dynamic graphs. However, in distributed TGNN training, efficiently managing the extensive cross-device communication caused by temporal dependencies poses a critical challenge, often resulting in significant redundant data transfer and high communication overhead. Existing systems struggle to effectively eliminate redundancy in data reuse and transmission, leading to severe communication bottlenecks in distributed environments. To address this, we propose Sven, a co-designed algorithm and system library specifically built to accelerate TGNN training on multi-GPU platforms. Sven leverages the dependency patterns of TGNN models to develop a redundancy-free graph organization, fundamentally reducing redundant data transfer. Additionally, we formalize the graph partitioning problem as minimizing the maximum communication cost, proving it to be NP-hard, and propose an approximate algorithm, Re-FlexBiCut, to solve it. Furthermore, Sven incorporates prefetching, adaptive micro-batch pipelining, and asynchronous pipelining mechanisms to construct a hierarchical pipelining approach that mitigates communication overhead.

Research Background

Figure 1: Temporal Graph Neural Network Training Process

Graphs, as a powerful data structure, are widely used across various domains to efficiently model real-world objects as relational data structures. In recent years, Graph Neural Networks (GNNs) have gained significant attention for their exceptional performance in handling graph data, with applications in node classification, link prediction, and graph classification tasks. Although methods like GCN and GraphSAGE have made notable progress on static graphs (graphs with fixed nodes and edges), they cannot handle the dynamic graph data prevalent in real-world scenarios. To overcome this limitation, Temporal Graph Neural Networks (TGNNs) have emerged. TGNNs not only capture the temporal information of dynamic graphs but also learn their topological relationships simultaneously. Research shows that TGNNs significantly outperform traditional static GNN methods, providing a more powerful tool for dynamic graph analysis tasks.
However, processing the rich information in dynamic graphs often requires large-scale parallel and distributed computing to support efficient TGNN training. To capture temporal dependencies, many TGNNs (e.g., TGN, APAN, and JODIE) adopt memory-based architectures, summarizing the historical behavior of each vertex through node memory and message modules. This approach introduces two key performance issues when scaling TGNN training. First, TGNNs need to update the latest temporal dependencies based on interactions of all vertices, leading to substantial redundant dependency data. Second, in large GPU clusters, temporal dependencies and model parameters are distributed across multiple devices, requiring all-to-all and all-gather communication for dependency distribution and aggregation, as well as all-reduce for synchronizing model parameter gradients. These communication operations significantly increase training time overhead, with measurements showing that communication can account for over 70% of TGNN training time. Therefore, eliminating redundant data and leveraging hierarchical pipelining techniques to reduce communication overhead has become the core motivation of this research.

Current Research Status

Currently, several GNN system frameworks support efficient and flexible static GNN training, such as PyG and AGL. However, these frameworks offer limited or no support for dynamic graphs. To improve dynamic graph training efficiency, several specialized frameworks have been proposed. For example, DGL provides basic interfaces for dynamic graph training, but these interfaces are inefficient and lack compatibility with distributed environments. TGL is a framework supporting large-scale offline TGNN training, capable of efficiently training various TGNN models on single-node multi-GPU setups. However, TGL faces significant communication overhead due to frequent host-device memory transfers. ESDG proposes a graph difference-based algorithm to reduce communication traffic and extend TGNN training to multi-node multi-GPU systems. However, ESDG only supports discrete-time dynamic graph (DTDG) TGNN training and cannot generalize to more general continuous-time dynamic graph (CTDG) models.

Recently, dependency caching (DepCache) and dependency communication (DepComm) mechanisms have been proposed for dependency maintenance in distributed GNN training. However, our experimental results show that these mechanisms are inefficient for TGNN training, primarily due to extensive cross-device communication overhead. Additionally, as TGNN dependency management demands higher communication traffic and computational performance, current solutions still fall short of meeting the requirements for efficient distributed training.
In summary, existing frameworks and methods still face significant limitations in supporting dynamic graph training, particularly for continuous-time dynamic graphs. There is an urgent need for new approaches to optimize redundant data processing and communication overhead in TGNN training, enabling more efficient distributed TGNN training solutions.

Research Methodology

Figure 2: Sven System Architecture: Algorithm-System Co-Design

This study proposes Sven, an algorithm-system co-designed framework for high-performance distributed Temporal Graph Neural Network (TGNN) training on multi-node multi-GPU systems, specifically targeting continuous-time dynamic graphs (CTDGs). Sven systematically addresses the issues of redundant dependencies and the imbalance between communication and computation time in TGNN training from a holistic and collaborative perspective. First, we analyze the behavior of TGNN models in handling temporal dependencies, finding that models need to distribute dependencies for computation in the current state and aggregate the latest dependencies after computation. Since a vertex may interact with others multiple times, this operation generates substantial redundant data during distribution and aggregation phases. We measure the redundancy ratio across multiple dynamic graph datasets, showing that extracting a redundancy-free graph from the original data effectively reduces redundant data in temporal dependency communication. Additionally, we enhance existing DepCache and DepComm mechanisms to maintain temporal dependency communication in distributed TGNN training. To address the significant data redundancy in temporal dependency communication (including all-gather in DepCache and all-to-all in DepComm), we propose optimizations to reduce communication volume at the source. To further mitigate communication overhead, we introduce an adaptive micro-batch pipelining method to reduce system overhead from all-to-all operations and design an asynchronous pipelining method to hide all-reduce communication, improving overall performance. Moreover, our research reveals that existing graph partitioning methods lead to imbalanced inter-device communication loads, significantly impacting distributed TGNN training performance. To address this, we formalize the problem more rigorously and design an approximate algorithm, Re-FlexBiCut, optimized for communication load. This method constructs source/sink graphs for each partition and allows vertex reassignment between partitions, achieving a near-balanced minimum cut solution that effectively improves inter-partition communication balance and optimizes distributed TGNN training performance.

Research Achievements

Figure 3: End-to-End Training Time Comparison

The Sven framework proposed in this study has achieved significant results in the field of distributed Temporal Graph Neural Network (TGNN) training, successfully addressing redundant dependency data and communication load imbalance issues, and providing a novel solution for high-performance training on large-scale dynamic graph data. As the first comprehensive optimization solution for memory-driven TGNN training, Sven's experimental results on a 64-GPU cluster demonstrate a 1.9x to 3.5x speedup in training compared to state-of-the-art methods, a 5.26x improvement in communication efficiency, and a reduction of up to 59.2% in communication imbalance, offering an efficient distributed solution for large-scale TGNN training.
The related research findings have gained widespread recognition in the high-performance computing community and have been published in the following top-tier conferences and journals:

1. The paper Redundancy-Free High-Performance Dynamic GNN Training with Hierarchical Pipeline Parallelism was presented at HPDC 23 (ACM International Symposium on High-Performance Parallel and Distributed Computing) (CCF-B) and received the Best Paper Runner-Up Award.
2. The paper Redundancy-free and load-balanced TGNN training with hierarchical pipeline parallelism was published in the CCF-A top-tier journal IEEE TPDS (Transactions on Parallel and Distributed Systems).