HetCCL Accelerates LLM Training with Heterogeneous GPU Clusters
arXiv.org
· February 02, 2026
· ✓ verified
Heehoon Kim et al. (paper authors) introduce HetCCL, a collective communication library that enables RDMA-based cross-vendor GPU communication without driver modifications.
- HetCCL unifies vendor-specific backends and enables RDMA-based communication across NVIDIA and AMD GPUs while leveraging vendor libraries NCCL and RCCL; the library requires no modifications to existing deep learning applications or GPU drivers and is presented as a solution for multi-vendor GPU clusters.
- Publication: arXiv submission arXiv:2601.22585 on 30 Jan 2026; Artifacts: PDF, HTML, TeX source and DOI link provided; License: CC BY-NC-ND 4.0. Evaluation: Experiments on a multi-vendor GPU cluster show HetCCL matches NCCL and RCCL performance in homogeneous setups and scales in heterogeneous environments.