SAIR: Cost-Efficient Multi-Stage ML Pipeline Autoscaling via LLM
arXiv.org
· February 02, 2026
· ✓ verified
The authors (Jianchang Su et al.) announce the SAIR autoscaling framework for multi-stage ML inference pipelines.
- Main announcement: SAIR uses an LLM as an in-context reinforcement learning controller, with Pareto-dominance reward shaping, surprisal-guided experience retrieval, and fine-grained GPU rate control via user-space CUDA interception; evaluated on four ML serving pipelines under three workload patterns and reports up to 50% P99 latency improvement, up to 97% effective cost reduction (under GPU rate-control assumptions), 86% bottleneck detection accuracy, and no offline training.
- Context and details: Submitted to arXiv (v1) on 29 Jan 2026 by Jianchang Su and six co-authors; paper includes regret analysis decomposing error into retrieval coverage and LLM selection components; full-text available as PDF, HTML, and TeX source, and licensed under CC BY 4.0.