Towards Interpretable and Efficient Attention: Compressing All by Contracting a Few
Abstract: Attention mechanisms in Transformer have gained significant empirical success, however, their mathematical derivation essence remains unclear. Meanwhile, the self-attention in Transformer suffers computation burdens due to quadratic complexity. Different from the prior work on addressing the interpretability or efficiency issue separately, in this paper, we propose a unified optimization objective, which compresses all inputs by contracting a few representatives of them, to derive an attention mechanism that tackles both issues mentioned above simultaneously. Specifically, we mathematically derive an interpretable and efficient self-attention operator by unfolding the gradient-based optimization steps on the proposed objective, which contracts the representatives and broadcast the contractions back to all the inputs to yield compact and structured representation. We thus refer to it as Contract-and-Broadcast Self-Attention (CBSA). To be more specific, given a fixed number of representatives, the computational overhead of CBSA scales linearly with respect to the number of input tokens. Moreover, by specifying different sets of representatives, we can derive its more efficient variants with sacrificed expression capacity. In particular, we demonstrate that: a) full attention derives from CBSA when the inputs are self-expressed; b) channel attention derives from CBSA when the representatives are fixed and orthogonal. We conduct extensive experiments on both synthetic and real world data, and the experimental results demonstrate the effectiveness of our proposed CBSA.
Loading