TRACE: Lossless Compression and Precision Scaling for CXL Bandwidth
arXiv.org
· February 02, 2026
· ✓ verified
Rui Xie et al. (arXiv:2509.03377 v3) present TRACE, a device-internal layout and KV-specific transform that enables lossless compression and precision-proportional fetch to unlock effective CXL bandwidth for LLM inference.
- Main announcement: TRACE preserves the unmodified CXL interface but changes the device-internal representation to a channel-major, disaggregated bit-plane layout and applies a KV-specific transform before compression; it enables precision-proportional fetch (reading only required bit-planes) and achieves lossless reductions of BF16 weight footprint by 25.2% and BF16 KV footprint by 46.9%, with per-layer KV ratios up to 2.69×.
- Background and evaluation details: The paper reports system-modeling results where, once KV spills to CXL, GPT-OSS-120B-MXFP4 throughput at 128k tokens improves from 16.28 to 68.99 tok/s (4.24×); DRAMSim3 shows up to 40.3% lower DRAM access energy under plane-aligned fetch; a 7 nm SystemVerilog implementation sustains 256 GB/s device bandwidth and TRACE adds 7.2% area, 4.7% power, and 6.0% load-to-use latency relative to a CXL controller with generic inline lossless compression (evaluated at 2 GHz, 0.7 V).