Computing Systems for AI-ICSLAB智能计算系统实验室

author：Zheng Zhang

Jan.2025

Memory-Efficient Training of Mixture-of-Experts Models with Adaptive Pipelining

Mixture-of-Experts (MoE) models have become one of the hottest technologies for scaling up pre-trained models. By dynamically activating experts for conditional computation, MoE models significantly increase the number of parameters in neural networks, making them essential for absorbing extensive knowledge in the field of deep learning. However, even with existing system and algorithm optimizations, low communication efficiency and high memory consumption remain significant challenges. This paper introduces MPipeMoE, a research work that accelerates MoE training through adaptive and memory-efficient pipelining. By extending adaptive pipelining methods, MPipeMoE implements memory reuse strategies to reduce memory requirements by eliminating memory redundancy.

author：Yaqi Xia

Jan.2025

HyDRA: Efficient Training Framework for Large-Scale Graph Neural Networks

This paper introduces HyDRA, an innovative framework designed to address challenges in training graph neural networks (GNNs) on large-scale graphs, such as memory constraints and data transfer bottlenecks. By seamlessly integrating sampling and data transfer into a single kernel operation through mechanisms like multi-GPU memory sharing and multi-node feature retrieval, HyDRA significantly enhances the efficiency of mini-batch training based on sampling.

author：Yaqi Xia

Jan.2025

Sven: Communication-Redundancy-Free Training Framework for Distributed Temporal Graph Neural Networks

As graph neural networks (GNNs) extend to dynamic graph data, temporal graph neural networks (TGNNs) have shown remarkable capabilities in handling dynamic graph data. However, in distributed TGNN training, efficiently handling the temporal dependencies that lead to significant cross-device communication becomes a key challenge, often resulting in substantial redundant data transfers and high communication overheads. Existing systems struggle to effectively eliminate redundancy in data reuse and transmission, thus exhibiting severe communication bottlenecks in distributed environments. To address this, we propose Sven, a co-designed algorithm and system library built to accelerate TGNN training on multi-GPU platforms. Sven leverages the dependency patterns of TGNN models to develop a communication-redundancy-free graph organization that fundamentally reduces redundant data transfers.

author：Hulin Wang

Jan.2025

Raptor-T: Memory-Efficient Sparse Transformer Model for Long and Variable Sequence Processing

Transformer-based models have made significant progress in various fields, mainly due to the self-attention mechanism's ability to capture contextual relationships in input sequences. However, processing long sequences remains computationally expensive for Transformer models, primarily due to the O(n²) complexity of the self-attention mechanism. To address this issue, sparse attention was proposed to reduce quadratic dependencies to linear. Nevertheless, efficient deployment of sparse Transformers still faces two major obstacles: 1) attention sparsity due to the algorithm's approximation nature leads to suboptimal system optimizations for sparse Transformers; 2) the variability of input sequences results in inefficient computation and memory access. This paper introduces the innovative Transformer framework Raptor-T, dedicated to addressing the systemic challenges of existing sparse attention models in long and variable sequence processing.

author：Hulin Wang

Jan.2025

CCFuser: Seamless Communication-Compute Fusion for MoE Models with GPU Shared Memory

Mixture of Experts (MoE) architectures have enhanced model quality by expanding model parameters. However, their development in distributed training is constrained by significant communication overheads and expert load imbalances. Existing methods only allow coarse-grained overlap of communication and computation, which slightly alleviates communication costs but significantly reduces computational efficiency. Moreover, current solutions to load imbalances often compromise model quality. We propose CCFuser, a new framework designed for efficient MoE model training. CCFuser replaces the typical expensive All2All operation in MoE architectures with efficient GPU shared memory access, enabling local and remote data to be concurrently computed in a fused kernel, significantly boosting the computational FLOPS of GEMM (General Matrix Multiplication) operations.

author：Weihu Wang

Jan.2025

Optimizing Deep Learning Recommendation Models with TT Decomposition

Deep Learning Recommendation Models (DLRMs) play a crucial role in personalized recommendations, ad serving, and e-commerce. However, the training process of DLRMs is limited by high memory consumption of embedding tables and communication overhead in distributed training, leading to inefficient computation. Existing methods like Tensor Train (TT) decomposition can effectively compress embedding tables, but introduce additional computational overhead. Traditional distributed training frameworks also face data transfer bottlenecks. To address these issues, this study proposes the EcoRec framework, which combines TT decomposition with distributed training to reduce redundant computations by optimizing the computation pattern.

author：Zheng Zhang

Dec.2024

Automatic Scheduling and Optimization for Compute-Intensive Operators

In the field of deep learning, operator fusion is a key approach to improving computational efficiency. However, traditional methods often fall short when handling chains of compute-intensive operators, leading to bottlenecks in performance optimization. The MCFuser framework was developed to address these challenges by introducing an efficient fusion method for memory-constrained compute-intensive (MBCI) operators. This approach not only enhances data locality but also resolves issues of redundant memory access. On NVIDIA GPUs, MCFuser demonstrated exceptional performance, achieving up to 5.9x speedup and reducing tuning time by 70x, providing robust support for deep learning computation optimization.