ORBITFLOW: SLO-Aware Long-Context LLM Serving with KV Reconfiguration

arXiv.org · January 19, 2026 · ✓ verified

Xinyue Ma and co-authors have introduced ORBITFLOW, a fine-grained and adaptive KV cache management system designed to meet latency SLOs in long-context LLM serving.

  • Main announcement: The paper presents ORBITFLOW, which uses a lightweight ILP solver to select which layers’ KV caches to keep on GPU per request (within GPU memory constraints), continuously refines KV placements based on runtime feedback, and provides a fallback mechanism that temporarily defers large in-flight requests under heavy load to preserve SLO attainment. The paper was accepted at the 52nd International Conference on Very Large Data Bases (VLDB 2026) and was submitted to arXiv on Mon, 5 Jan 2026 (arXiv:2601.10729).

  • Performance and implementation details: The authors report SLO attainment improvements of up to 66% (TPOT) and 48% (TBT), a 95th percentile latency reduction of 38%, and up to 3.3x higher throughput versus existing offloading methods; key mechanisms include per-request GPU KV retention, runtime reconfiguration, and a defer/fallback policy for heavy requests. The paper is available under CC BY 4.0 and provides PDF, HTML, and TeX source links on arXiv.