CONCUR: Congestion-Based Concurrency Control for Agentic LLM Inference

          arXiv.org
          
            · February 02, 2026
          
          · ✓ verified

Qiaoling Chen et al. have introduced CONCUR, a congestion-based concurrency control layer for agentic batch LLM inference (arXiv submission v1, 30 Jan 2026).

Main announcement: The paper presents CONCUR, a lightweight control layer implementing agent-level admission control inspired by congestion control to bound aggregate GPU KV cache pressure, prevent middle-phase thrashing, and preserve execution continuity; reported throughput improvements are up to 4.09x on Qwen3-32B and 1.9x on DeepSeek-V3 (evaluated across large models and real-world agent workloads).
Background and details: The authors identify middle-phase thrashing as cache efficiency collapse caused by long-lived agents accumulating state; CONCUR adapts a cache-aware control algorithm to dynamically adjust the number of active agents using runtime cache signals, is compatible with existing LLM serving systems, and the work is available on arXiv (arXiv:2601.22705) under CC BY 4.0.

Read original source ↗