Reinforcement learning scheduler RLTune optimizes DL GPU clusters
arXiv.org
· December 12, 2025
· ✓ verified
The authors introduce RLTune, a reinforcement learning and MILP-based framework for dynamic scheduling of deep learning workloads on heterogeneous GPU clusters.
- RLTune combines RL-driven job prioritization with MILP-based job-to-node mapping, trained on large-scale production traces from Microsoft Philly, Helios, and Alibaba, achieving up to 20% higher GPU utilization, 81% lower queueing delay, and 70% shorter job completion time (JCT) without per-job profiling.
- The work targets modern cloud platforms hosting large-scale DL workloads, addressing challenges from GPU heterogeneity and limited application visibility, and is positioned as application-agnostic and suitable for cloud provider-scale deployment for more efficient and fair DL workload management.