MatKV trades Compute for Flash Storage in LLM Inference

          arXiv.org
          
            · December 30, 2025
          
          · ✓ verified

Kun-Woo Shin et al. (Seoul National University and Samsung Electronics) propose MatKV, a system that precomputes and materializes key-value (KV) vectors of RAG objects on flash storage to avoid repeated GPU-based KV computation at inference time.

Main announcement: The authors introduce MatKV, which precomputes KVs, stores them on inexpensive, fast, power-efficient flash SSDs, and reuses them at inference, claiming that MatKV reduces both inference time and power consumption by half for RAG workloads; experiments use Hugging Face’s Transformers across state-of-the-art GPUs and flash SSDs. (Paper submitted 20 Dec 2025; Accepted for publication in ICDE 2026.)
Additional details / methods: The paper focuses on making the prefill phase efficient by materializing KVs; it demonstrates two optimizations: (1) concurrent decoding and KV loading where a GPU decodes while loading materialized KVs for the next instance to reduce load latency, and (2) enabling the use of low-end GPUs for decoding once KVs are loaded into GPU memory, with reported minimal impact on throughput and QA accuracy.

Read original source ↗