Bottlenecks for Efficient LLM Inference with KV Offloading

          arXiv.org
          
            · January 29, 2026
          
          · ✓ verified

William Meng and co-authors (University of Pennsylvania; Intel) submitted an arXiv paper analysing KV cache offloading bottlenecks for long-context LLM inference.

Main announcement: The paper derives \kappa_{\text{crit}}, the critical cached-to-prefill token ratio where execution becomes memory-bound, and reports that typical workloads exceed this threshold by orders of magnitude; it also presents empirical measurements showing 99% of latency spent on transfers and GPUs consuming only 28% of rated TDP, and proposes optimizations for hardware interconnects, model architectures, and scheduling algorithms.
Background and details: Submitted to MLSys 2026 (arXiv:2601.19910, submitted 16 Dec 2025); full-text PDF, HTML, and TeX source are provided on arXiv, DOI via DataCite, and the paper is licensed under CC BY 4.0.

Read original source ↗