CONCUR: Congestion-Based Concurrency Control for Agentic LLM Inference
arXiv.org
· February 02, 2026
· ✓ verified
Qiaoling Chen et al. have introduced CONCUR, a congestion-based concurrency control layer for agentic batch LLM inference (arXiv submission v1, 30 Jan 2026).
- Main announcement: The paper presents CONCUR, a lightweight control layer implementing agent-level admission control inspired by congestion control to bound aggregate GPU KV cache pressure, prevent middle-phase thrashing, and preserve execution continuity; reported throughput improvements are up to 4.09x on Qwen3-32B and 1.9x on DeepSeek-V3 (evaluated across large models and real-world agent workloads).
- Background and details: The authors identify middle-phase thrashing as cache efficiency collapse caused by long-lived agents accumulating state; CONCUR adapts a cache-aware control algorithm to dynamically adjust the number of active agents using runtime cache signals, is compatible with existing LLM serving systems, and the work is available on arXiv (arXiv:2601.22705) under CC BY 4.0.