Fault-tolerant Collective Communication for LLM Training and Serving

arXiv.org · January 01, 2026 · ✓ verified

Wei Wang and co-authors present R^2CCL, a fault-tolerant collective communication library designed to provide lossless, low-overhead failover for LLM training and inference by exploiting multi-NIC hardware.

  • Main announcement: The paper introduces R^2CCL, which implements rapid connection migration, bandwidth-aware load redistribution, and resilient collective algorithms to maintain progress under network and NIC failures; evaluated on two 8-GPU H100 InfiniBand servers and via large-scale ML simulators modeling hundreds of GPUs, reporting <1% training and <3% inference overheads and outperforming baselines AdapCC and DejaVu by 12.18× and 47×, respectively.
  • Background and details: The authors quantify the problem that network faults can waste 10–15% of GPU hours due to slow recovery; R^2CCL exploits multi-NIC hardware for lossless failover; submission details: arXiv:2512.25059, submitted 31 Dec 2025 (v1), PDF and TeX source available.