SAIR: Cost-Efficient Multi-Stage ML Pipeline Autoscaling via LLM

          arXiv.org
          
            · February 02, 2026
          
          · ✓ verified

The authors (Jianchang Su et al.) announce the SAIR autoscaling framework for multi-stage ML inference pipelines.

Main announcement: SAIR uses an LLM as an in-context reinforcement learning controller, with Pareto-dominance reward shaping, surprisal-guided experience retrieval, and fine-grained GPU rate control via user-space CUDA interception; evaluated on four ML serving pipelines under three workload patterns and reports up to 50% P99 latency improvement, up to 97% effective cost reduction (under GPU rate-control assumptions), 86% bottleneck detection accuracy, and no offline training.
Context and details: Submitted to arXiv (v1) on 29 Jan 2026 by Jianchang Su and six co-authors; paper includes regret analysis decomposing error into retrieval coverage and LLM selection components; full-text available as PDF, HTML, and TeX source, and licensed under CC BY 4.0.