Reinforcement learning scheduler RLTune optimizes DL GPU clusters

          arXiv.org
          
            · December 12, 2025
          
          · ✓ verified

The authors introduce RLTune, a reinforcement learning and MILP-based framework for dynamic scheduling of deep learning workloads on heterogeneous GPU clusters.

RLTune combines RL-driven job prioritization with MILP-based job-to-node mapping, trained on large-scale production traces from Microsoft Philly, Helios, and Alibaba, achieving up to 20% higher GPU utilization, 81% lower queueing delay, and 70% shorter job completion time (JCT) without per-job profiling.
The work targets modern cloud platforms hosting large-scale DL workloads, addressing challenges from GPU heterogeneity and limited application visibility, and is positioned as application-agnostic and suitable for cloud provider-scale deployment for more efficient and fair DL workload management.

Read original source ↗