STAR spatial accelerator for sparse Transformer attention
arXiv.org
· December 24, 2025
· ✓ verified
The authors propose STAR, an algorithm-hardware co-designed accelerator architecture for sparse attention in Transformer-based large language model inference under large-scale token parallelism (LTPP).
- STAR introduces leading-zero-based sparsity prediction, distributed sorting, and a sorted updating FlashAttention mechanism with coordinated cross-stage tiling, reducing redundant computation, memory access, latency, and improving compute and energy efficiency versus existing dynamic sparsity accelerators and NVIDIA A100.
- A dedicated STAR accelerator and a multi-core Spatial-STAR spatial architecture are evaluated, showing up to 9.2× speedup and 71.2× energy efficiency over A100, up to 16.1× energy and 27.1× area efficiency gains over state-of-the-art accelerators, and 20.1× throughput improvement for ultra-long sequence processing compared with a baseline spatial design.