- Original Pdf: pdf
- Abstract: Transformers employ dense attention mechanisms over text which can fail to capture or utilize the strong intrinsic structures present in natural language. This paper presents the Combiner model, a new Transformer architecture that learns tree-structured attention patterns inductively from language. Instead of dense or pre-specified structures, Combiner automatically learns tree-structured attention connections using a novel sparse residual attention mechanism. It first employs a sparsity-inducing gate that learns to prune attention connections in each network layer, so as to determine the nodes to be combined. Then the learned connections are propagated through layers using hierarchical attention blocks, which combine the sub-tree nodes in a bottom-up manner. Our experiments demonstrate the robust modeling performance of Combiner and usefulness of structures it learns in various information retrieval and unsupervised sentence parsing tasks. By leveraging search session structures, Combiner outperforms other pre-trained Transformers in generative query suggestion. Moreover, the learned tree structures align well with linguistic structures and improve the current state-of-the-art unsupervised constituency parsing by 8 average sentence-level F1.