LLM Inference Efficiency: Quantization, Batching, and Serving Strategies

arXiv.org · February 02, 2026 · ✓ verified

A paper by Julien Delavande, Regis Pierrard, and Sasha Luccioni (arXiv:2601.22362) presents an empirical study of LLM inference energy and latency and reports specific system-level optimizations.

  • Main announcement/action: The paper provides a detailed empirical evaluation on NVIDIA H100 GPUs showing that system-level design choices (quantization, batch size, serving configuration, and request scheduling) can produce orders-of-magnitude differences in inference energy; notably, arrival shaping (structured request timing) can reduce per-request energy by up to 100×. The study evaluates quantization, batching, and use of Hugging Face’s Text Generation Inference server across compute- and memory-bound phases.
  • Background and details: The authors report that lower-precision formats only yield energy gains in compute-bound regimes, while batching improves energy efficiency particularly during memory-bound decoding; the paper advocates phase-aware energy profiling and system-level orchestration for sustainable LLM deployment. No monetary values or external funding amounts are announced.
Keep reading
Nordic data centers pioneer sustainable cooling and heat reuse atNorth · Jun 22 Data4 launches major European recruitment campaign for growth DATA4 Group · Jun 22 NVIDIA Rubin enables 45°C liquid-cooled AI data centers NVIDIA · Jun 22 Equinix trials hydrogen power units at Dublin data center Hydrogen Europe · Jun 19
Telborg · US Data Centers
Track the US data-center buildout — every day.

Real-time verified news and daily AI-written briefings, built from primary sources — power, grid, permits, land, financing. Start free.

Get Telborg Pro · $189/mo Get the daily briefing — free →