Near-Storage Processing Boosts Offline Inference for Long-Context LLMs

arXiv.org · February 09, 2026 · ✓ verified

HILOS framework introduces attention near storage to accelerate offline inference for long-context LLMs.

  • Main announcement: The authors present HILOS, a near-storage processing framework that uses attention near storage plus three optimizations—cooperative X-cache, delayed KV cache writeback, and a memory-efficient attention accelerator—to reduce interconnect traffic and KV cache I/O; the system was implemented and evaluated on a real system with 16 SmartSSDs and achieves up to 7.86x throughput and up to 85% lower energy consumption.
  • Background and details: The paper is submitted to ASPLOS’26 (arXiv:2502.09921, revised v2 on 6 Feb 2026); source code is available at https://github.com/hongsunjang/HILOS; evaluation details cite 16 SmartSSDs and comparisons to state-of-the-art offloading-based inference frameworks (performance and energy metrics reported).