Energy Efficiency Sweet Spots in Production LLM Inference

arXiv.org · February 06, 2026 · ✓ verified

Hiari Pizzini Cavagna and co-authors present an analytical model that identifies LLM inference energy-efficiency “Sweet Spots” and validates it empirically.

  • Main announcement: The authors introduce an analytical model (based on Transformer compute and memory-access complexity) and validate it using TensorRT-LLM on NVIDIA H100 GPUs, across models ranging from 1B to 9B parameters (OPT, LLaMA, Gemma, Falcon, Qwen2, Granite); they test input/output lengths from 64 to 4096 tokens and report a mean MAPE of 1.79%. The paper is submitted 5 Feb 2026 and is to appear at ICPE 2026.

  • Background and details: The evaluation uses production-oriented inference tooling (TensorRT-LLM) and focuses on empirical energy consumption regimes (peak efficiency with short-to-moderate inputs and medium-length outputs). Full-text is available via PDF and HTML on arXiv; the submission is released under CC BY 4.0. No operational deployment timeline or commercial contracts are announced.