Research Summary
Efficiently training Graph Neural Networks (GNNs) on large-scale graphs (with over billions of edges) presents significant challenges, particularly due to memory constraints and data transfer bottlenecks. These issues are especially pronounced in GPU-based sampling. Traditional methods either suffer from severe performance bottlenecks due to data transfer between CPU and GPU or are limited by excessive data rearrangement and synchronization overhead in multi-GPU setups. To address these problems in large-scale GNN training, we propose HyDRA, a groundbreaking framework that enhances the efficiency of sampling-based mini-batch training. HyDRA introduces innovations in multi-GPU memory sharing and multi-node feature retrieval, revolutionizing cross-GPU sampling by seamlessly integrating sampling and data transfer into a single kernel operation. The framework employs a hybrid pointer-driven data placement technique to improve neighbor retrieval efficiency, designs a precise replication strategy for high-degree nodes to reduce communication overhead, and utilizes dynamic cross-batch data orchestration and pipelining to minimize redundant data transfer.
Research Background

Figure 1: GNN Training Process Based on Sampling
Graph-structured data is extremely common in real-world scenarios, where vertices and edges represent complex entity relationships, such as user interactions in social networks or molecular structures in biological networks. Traditional machine learning methods cannot fully utilize the topological information of graphs, whereas GNNs leverage techniques like graph convolution and attention mechanisms to extract deep insights from structure and features, enabling efficient processing of tasks like node classification and link prediction.
Despite their potential, the training efficiency of GNNs on large-scale graph data remains a significant challenge. Traditional full-batch methods require loading the entire graph into memory for training. However, when the graph data reaches the scale of billions of edges, the memory requirement easily exceeds 60 GB, far surpassing the capacity of high-end GPUs (e.g., 40 GB A100). This makes full-batch training methods impractical. As a result, sampling-based GNN methods have emerged. By randomly selecting subgraphs (i.e., mini-batch topological structures) during training, these methods significantly reduce memory overhead and offer advantages in model accuracy, computational complexity, and generalization ability.
Sampling-based GNNs typically consist of two key steps: graph sampling, where a subgraph containing a subset of nodes and edges is selected from the large original graph to form the mini-batch topology; and feature retrieval, where the attribute information of the selected vertices and their neighbors is accessed from storage. While this process addresses the memory challenges posed by large-scale graph data, it also introduces new technical bottlenecks.
Current Research Status
Researchers have proposed various methods for GNN training on large-scale graph data. From early CPU-based sampling frameworks like DGL and PyG, to GPU virtual memory techniques like DGL-UVA, and more efficient multi-GPU distributed sampling methods like DSP, the technology has progressively advanced. However, these methods still face bottlenecks in the following areas:
1. Inefficient CPU-GPU Data Transfer
Most existing frameworks rely on CPU memory for graph data storage and perform graph sampling on the CPU. The sampled results are then transferred to the GPU via PCIe for training. However, CPU sampling is slow, and the limited bandwidth of PCIe leads to underutilization of GPU resources, preventing the full exploitation of hardware computational potential.
2. GPU Memory Limitations
In recent years, various GPU-based caching optimization methods (e.g., PaGraph, BGL, and GNNLab) have attempted to optimize GPU memory usage by caching critical data. However, due to the limited capacity of single GPUs, these methods only partially mitigate data transfer overhead, as uncached data still needs to be transferred via PCIe, resulting in limited performance improvements.
3. Communication Overhead in Distributed Sampling
As a pioneer in multi-GPU sampling, DSP introduced a new method of partitioning graph topology across multiple GPUs and implemented cross-GPU neighbor retrieval through collective communication. However, its design relies on frequent all-to-all communication and requires extensive data rearrangement (e.g., distributing target vertices to different GPUs and then re-aggregating them), further exacerbating communication costs.
Research Methodology

Figure 2: HyDRA Architecture: Memory-Sharing Graph Sampling and Hierarchical Cross-Batch Feature Retrieval
This study proposes a novel framework called HyDRA, designed to address the limitations of existing sampling-based GNN training methods when handling graph data with billions of edges. The core design of HyDRA lies in two key innovations: single-node multi-GPU memory sharing and multi-node feature retrieval, which effectively optimize storage and communication efficiency during graph sampling.
First, we propose a novel multi-GPU graph sampling method that significantly reduces data rearrangement and synchronization overhead in existing DSP frameworks through GPU memory sharing. For graph topology data management, considering that it only accounts for about 10% of the overall graph data, we partition and allocate it across multiple GPUs within a single node to avoid costly cross-node communication. To optimize sampling efficiency, we merge sampling and remote neighbor retrieval into a single kernel operation, thereby eliminating extensive data rearrangement and frequent cross-GPU synchronization. Furthermore, we design a hybrid pointer-driven data management mechanism that allows graph topology data to be shared across multiple GPUs while maintaining local neighbor list address mappings for each GPU.
However, during GPU memory sharing, we observed that inefficient graph partitioning methods led to a significant increase in communication volume, up to 8 times the original amount. Through in-depth analysis, we attributed this issue to high-degree vertices (i.e., vertices with a large number of neighbors). To address this bottleneck, we developed an innovative partitioning strategy based on high-degree vertex replication. Specifically, we replicate the neighbor lists of these vertices across each GPU partition, converting remote accesses into local reads. This strategy effectively reduces communication overhead and significantly improves multi-GPU sampling performance.
Second, to optimize storage and computational efficiency, we partition the feature data, which occupies a large amount of storage space, across multiple nodes and GPUs in the cluster. For multi-node feature retrieval, we designed a hierarchical data reuse strategy to address the inefficiencies caused by frequent data movement in traditional methods. During mini-batch training, especially with larger batch sizes, there is significant overlap in data between different batches. Traditional methods often discard data from previous batches and reload new data, overlooking potential optimization opportunities. To this end, we propose a dynamic cross-batch data scheduling mechanism that captures and reuses this overlapping data, significantly improving data utilization efficiency.
In multi-node, multi-GPU systems, due to the unevenness of communication bandwidth, we further extend the concept of hierarchical data reuse. Specifically, this hierarchical approach is divided into three levels: within a single GPU, among multiple GPUs within the same node, and across nodes. By prioritizing higher-bandwidth communication channels, our hierarchical method effectively reduces communication latency. Additionally, this systematic data reuse approach not only leverages data locality and redundancy but also optimizes the alignment of data flow with network capabilities, significantly reducing overall communication overhead and providing strong support for efficient GNN training.
Research Achievements

Figure 3: End-to-End Training Time Comparison
The HyDRA framework proposed in this study has made significant progress in the field of large-scale GNN training, successfully addressing key challenges in sampling-based GNN methods for handling graph data with billions of edges. Through innovative single-node multi-GPU memory sharing and multi-node feature retrieval strategies, HyDRA significantly improves graph sampling and training efficiency, overcoming the memory and communication bottlenecks of existing frameworks. When evaluated on a system with up to 64 A100 GPUs, HyDRA outperformed current leading methods, achieving 1.4x to 5.3x faster training speeds compared to DSP and DGL-UVA, while also improving multi-GPU scalability by up to 42x. HyDRA sets a new benchmark for high-performance training of large-scale GNNs.
The related research findings have been accepted at SC 24 (The International Conference for High Performance Computing, Networking, Storage, and Analysis), a top-tier CCF-A conference, under the title "Scaling New Heights: Transformative Cross-GPU Sampling for Training Billion-Edge Graphs."