Fault-tolerant Collective Communication for LLM Training and Serving

arXiv.org · January 01, 2026 · ✓ verified

Wei Wang and co-authors present R^2CCL, a fault-tolerant collective communication library designed to provide lossless, low-overhead failover for LLM training and inference by exploiting multi-NIC hardware.

  • Main announcement: The paper introduces R^2CCL, which implements rapid connection migration, bandwidth-aware load redistribution, and resilient collective algorithms to maintain progress under network and NIC failures; evaluated on two 8-GPU H100 InfiniBand servers and via large-scale ML simulators modeling hundreds of GPUs, reporting <1% training and <3% inference overheads and outperforming baselines AdapCC and DejaVu by 12.18× and 47×, respectively.
  • Background and details: The authors quantify the problem that network faults can waste 10–15% of GPU hours due to slow recovery; R^2CCL exploits multi-NIC hardware for lossless failover; submission details: arXiv:2512.25059, submitted 31 Dec 2025 (v1), PDF and TeX source available.
Keep reading
JUPITER exascale powers brain mapping, climate, 6G and quantum NVIDIA · Jun 22 NAIRR pilot accelerates scientific AI research with NVIDIA DGX NVIDIA · Jun 22 Eco Wave Power Uses NVIDIA AI To Harness Wave Energy NVIDIA · Jun 22 Nordic data centers pioneer sustainable cooling and heat reuse atNorth · Jun 22
Telborg · US Data Centers
Track the US data-center buildout — every day.

Real-time verified news and daily AI-written briefings, built from primary sources — power, grid, permits, land, financing. Start free.

Get Telborg Pro · $189/mo Get the daily briefing — free →