Abstract:The attention mechanisms of transformers effectively extract pertinent information from the input sequence. However, the quadratic complexity of self-attention incurs heavy computational and memory burdens. Sparse attention techniques, including both static and dynamic sparsity, reduce the quadratic complexity by computing attention on partial queries and keys. These static and dynamic methods exhibit a tradeoff between efficiency and adaptability, making them applicable to different scenarios. However, existing accelerators either target-specific domains or encounter performance degradation when dealing with long sequences. None of them can enable static and dynamic sparse attention mechanisms simultaneously. To this end, we propose SALO2, a hardware–software co-design framework that facilitates efficient static and dynamic sparse attention computations and can be applied to various scenarios, tasks, and inputs. Experiments show that SALO2 achieves $104.80\times $ , $13.65\times $ , and $1.38\times $ speedup compared to Intel Xeon CPU, NVIDIA RTX4090 GPU, and SALO (the SOTA accelerator exploiting static sparsity) on tasks with long input sequences, and achieves $76.17\times $ , $8.98\times $ , and $1.71\times $ speedup compared to Intel Xeon CPU, NVIDIA RTX4090 GPU, and Sanger (the SOTA accelerator exploiting dynamic sparsity) on tasks with shorter sequences. The source code is available at https://github.com/sjtu-zhao-lab/SALO.git.