Reinforcement learning scheduler RLTune optimizes DL GPU clusters

arXiv.org · December 12, 2025 · ✓ verified

The authors introduce RLTune, a reinforcement learning and MILP-based framework for dynamic scheduling of deep learning workloads on heterogeneous GPU clusters.

  • RLTune combines RL-driven job prioritization with MILP-based job-to-node mapping, trained on large-scale production traces from Microsoft Philly, Helios, and Alibaba, achieving up to 20% higher GPU utilization, 81% lower queueing delay, and 70% shorter job completion time (JCT) without per-job profiling.
  • The work targets modern cloud platforms hosting large-scale DL workloads, addressing challenges from GPU heterogeneity and limited application visibility, and is positioned as application-agnostic and suitable for cloud provider-scale deployment for more efficient and fair DL workload management.
Keep reading
Webinar: Canada's Cloud Sovereignty—Where Should the Lines Fall? Information Technology and Innovation Foundation · Jun 09 ANGOTIC 2026 spotlights digital transformation, AI, startups and infrastructure APO Group - Africa · Jun 05 Bank of Italy publishes five new QEF research papers Banca d'Italia · Jun 05 Global data center expansion and implications for electricity consumption Banca d'Italia · Jun 05
Telborg · US Data Centers
Track the US data-center buildout — every day.

Real-time verified news and daily AI-written briefings, built from primary sources — power, grid, permits, land, financing. Start free.

Get Telborg Pro · $189/mo Get the daily briefing — free →