NVIDIA software stack cuts token costs on Blackwell GPUs
NVIDIA
· June 30, 2026
· ✓ verified
NVIDIA announces its inference software stack reduces token cost and increases throughput on Blackwell GPUs.
- Main announcement: NVIDIA’s full-stack inference software for the Blackwell platform has reduced token costs by up to 5x on the DeepSeek V4 model in about one month, and combining system-level optimizations (disaggregated serving, large expert parallelism, NVFP4, multi-token prediction) can increase throughput by up to 20x. The blog cites partner results such as Baseten reporting up to 50% more tokens/sec, DigitalOcean / Hippocratic AI reporting ~30% higher inference throughput while maintaining sub-half-second time-to-first-response, and day-zero deployment recipes for vLLM and SGLang.
- Background and implementation details: The announcement explains the stack connects three layers — Production Operation, Application Acceleration, and Infrastructure Access — and leverages open source frameworks (PyTorch, vLLM, SGLang) and runtimes (TensorRT-LLM, NVIDIA Dynamo). It references concrete software features and optimizations (DFlash speculative decode up to 15x throughput, NVLink interconnect, NVFP4 precision) and notes these improvements were observed in production-focused tests and partner deployments within a short timeframe (about a month).