Abstract: Fine-tuning transformer models for multi-document summarization is a widely applied approach due to their ability to capture complex relationships across documents. However, full-attention transformer models often struggle with the long-sequence problem, where the computational complexity grows quadratically with the sequence length. Additionally, the optimization cost of transformer is also exceedingly high. To address these challenges, we propose a novel vertical scaling approach. In this approach, we conditionally factorize the multi-document output probability by lower-complexity components. Specifically, these components are estimated by estimators optimized for single-doc data. Unlike the full-attention approach, vertical scaling has a complexity that scales linearly with the number of single documents, making it more efficient for long documents or large numbers of documents. To further enhance the efficiency and effectiveness of our approach, we introduced the Multi-Channel Attention architecture. This architecture enables us to fully utilize BART’s single-doc pre-optimized parameters, while does not require re-optimization, leading to a zero-cost transition. Our approach maintains promising accuracy and computing efficiency. We publish our implementation and related data at https://github.com/nbtpj/MCA.
Loading