Bottlenecks for Efficient LLM Inference with KV Offloading
arXiv.org
· January 29, 2026
· ✓ verified
William Meng and co-authors (University of Pennsylvania; Intel) submitted an arXiv paper analysing KV cache offloading bottlenecks for long-context LLM inference.
- Main announcement: The paper derives \kappa_{\text{crit}}, the critical cached-to-prefill token ratio where execution becomes memory-bound, and reports that typical workloads exceed this threshold by orders of magnitude; it also presents empirical measurements showing 99% of latency spent on transfers and GPUs consuming only 28% of rated TDP, and proposes optimizations for hardware interconnects, model architectures, and scheduling algorithms.
- Background and details: Submitted to MLSys 2026 (arXiv:2601.19910, submitted 16 Dec 2025); full-text PDF, HTML, and TeX source are provided on arXiv, DOI via DataCite, and the paper is licensed under CC BY 4.0.