STAR spatial accelerator for sparse Transformer attention

arXiv.org · December 24, 2025 · ✓ verified

The authors propose STAR, an algorithm-hardware co-designed accelerator architecture for sparse attention in Transformer-based large language model inference under large-scale token parallelism (LTPP).

  • STAR introduces leading-zero-based sparsity prediction, distributed sorting, and a sorted updating FlashAttention mechanism with coordinated cross-stage tiling, reducing redundant computation, memory access, latency, and improving compute and energy efficiency versus existing dynamic sparsity accelerators and NVIDIA A100.
  • A dedicated STAR accelerator and a multi-core Spatial-STAR spatial architecture are evaluated, showing up to 9.2× speedup and 71.2× energy efficiency over A100, up to 16.1× energy and 27.1× area efficiency gains over state-of-the-art accelerators, and 20.1× throughput improvement for ultra-long sequence processing compared with a baseline spatial design.