Computing Systems for AI

Computing Systems for AI

Memory-Efficient Training of Mixture-of-Experts Models with Adaptive Pipelining
Mixture-of-Experts (MoE) models have become one of the hottest technologies for scaling up pre-trained models. By dynamically activating experts for conditional computation, MoE models significantly increase the number of parameters in neural networks, making them essential for absorbing extensive knowledge in the field of deep learning. However, even with existing system and algorithm optimizations, low communication efficiency and high memory consumption remain significant challenges. This paper introduces MPipeMoE, a research work that accelerates MoE training through adaptive and memory-efficient pipelining. By extending adaptive pipelining methods, MPipeMoE implements memory reuse strategies to reduce memory requirements by eliminating memory redundancy.
HyDRA: Efficient Training Framework for Large-Scale Graph Neural Networks
This paper introduces HyDRA, an innovative framework designed to address challenges in training graph neural networks (GNNs) on large-scale graphs, such as memory constraints and data transfer bottlenecks. By seamlessly integrating sampling and data transfer into a single kernel operation through mechanisms like multi-GPU memory sharing and multi-node feature retrieval, HyDRA significantly enhances the efficiency of mini-batch training based on sampling.
Sven: Communication-Redundancy-Free Training Framework for Distributed Temporal Graph Neural Networks
As graph neural networks (GNNs) extend to dynamic graph data, temporal graph neural networks (TGNNs) have shown remarkable capabilities in handling dynamic graph data. However, in distributed TGNN training, efficiently handling the temporal dependencies that lead to significant cross-device communication becomes a key challenge, often resulting in substantial redundant data transfers and high communication overheads. Existing systems struggle to effectively eliminate redundancy in data reuse and transmission, thus exhibiting severe communication bottlenecks in distributed environments. To address this, we propose Sven, a co-designed algorithm and system library built to accelerate TGNN training on multi-GPU platforms. Sven leverages the dependency patterns of TGNN models to develop a communication-redundancy-free graph organization that fundamentally reduces redundant data transfers.
Raptor-T: Memory-Efficient Sparse Transformer Model for Long and Variable Sequence Processing
Transformer-based models have made significant progress in various fields, mainly due to the self-attention mechanism's ability to capture contextual relationships in input sequences. However, processing long sequences remains computationally expensive for Transformer models, primarily due to the O(n²) complexity of the self-attention mechanism. To address this issue, sparse attention was proposed to reduce quadratic dependencies to linear. Nevertheless, efficient deployment of sparse Transformers still faces two major obstacles: 1) attention sparsity due to the algorithm's approximation nature leads to suboptimal system optimizations for sparse Transformers; 2) the variability of input sequences results in inefficient computation and memory access. This paper introduces the innovative Transformer framework Raptor-T, dedicated to addressing the systemic challenges of existing sparse attention models in long and variable sequence processing.
CCFuser: Seamless Communication-Compute Fusion for MoE Models with GPU Shared Memory
Mixture of Experts (MoE) architectures have enhanced model quality by expanding model parameters. However, their development in distributed training is constrained by significant communication overheads and expert load imbalances. Existing methods only allow coarse-grained overlap of communication and computation, which slightly alleviates communication costs but significantly reduces computational efficiency. Moreover, current solutions to load imbalances often compromise model quality. We propose CCFuser, a new framework designed for efficient MoE model training. CCFuser replaces the typical expensive All2All operation in MoE architectures with efficient GPU shared memory access, enabling local and remote data to be concurrently computed in a fused kernel, significantly boosting the computational FLOPS of GEMM (General Matrix Multiplication) operations.
Optimizing Deep Learning Recommendation Models with TT Decomposition
Deep Learning Recommendation Models (DLRMs) play a crucial role in personalized recommendations, ad serving, and e-commerce. However, the training process of DLRMs is limited by high memory consumption of embedding tables and communication overhead in distributed training, leading to inefficient computation. Existing methods like Tensor Train (TT) decomposition can effectively compress embedding tables, but introduce additional computational overhead. Traditional distributed training frameworks also face data transfer bottlenecks. To address these issues, this study proposes the EcoRec framework, which combines TT decomposition with distributed training to reduce redundant computations by optimizing the computation pattern.
Automatic Scheduling and Optimization for Compute-Intensive Operators
In the field of deep learning, operator fusion is a key approach to improving computational efficiency. However, traditional methods often fall short when handling chains of compute-intensive operators, leading to bottlenecks in performance optimization. The MCFuser framework was developed to address these challenges by introducing an efficient fusion method for memory-constrained compute-intensive (MBCI) operators. This approach not only enhances data locality but also resolves issues of redundant memory access. On NVIDIA GPUs, MCFuser demonstrated exceptional performance, achieving up to 5.9x speedup and reducing tuning time by 70x, providing robust support for deep learning computation optimization.
FirstPrevious1NextLast