LLM Inference Efficiency: Quantization, Batching, and Serving Strategies
arXiv.org
· February 02, 2026
· ✓ verified
A paper by Julien Delavande, Regis Pierrard, and Sasha Luccioni (arXiv:2601.22362) presents an empirical study of LLM inference energy and latency and reports specific system-level optimizations.
- Main announcement/action: The paper provides a detailed empirical evaluation on NVIDIA H100 GPUs showing that system-level design choices (quantization, batch size, serving configuration, and request scheduling) can produce orders-of-magnitude differences in inference energy; notably, arrival shaping (structured request timing) can reduce per-request energy by up to 100×. The study evaluates quantization, batching, and use of Hugging Face’s Text Generation Inference server across compute- and memory-bound phases.
- Background and details: The authors report that lower-precision formats only yield energy gains in compute-bound regimes, while batching improves energy efficiency particularly during memory-bound decoding; the paper advocates phase-aware energy profiling and system-level orchestration for sustainable LLM deployment. No monetary values or external funding amounts are announced.