Attend, Select and Eliminate: Accelerating Multi-turn Response Selection with Dual-attention-based Content EliminationDownload PDF

Anonymous

16 Oct 2022 (modified: 05 May 2023)ACL ARR 2022 October Blind SubmissionReaders: Everyone
Keywords: response selection, token elimination, accelerate inference, dual-attention
Abstract: Although the incorporation of pre-trained language models (PLMs) significantly pushes the research frontier of multi-turn response selection, it brings a new issue of heavy computation costs. To alleviate this problem and make the PLM-based response selection model both effective and efficient, we propose an inference framework together with a post-training strategy that builds upon any pre-trained transformer-based response selection models to accelerate inference by progressively selecting and eliminating unimportant content under the guidance of context-response dual-attention.Specifically, at each transformer layer, we first identify the importance of each word based on context-to-response and response-to-context attention, then select a number of unimportant words to be eliminated following a retention configuration derived from evolutionary search while passing the rest of the representations into deeper layers.To mitigate the training-inference gap posed by content elimination, we introduce a post-training strategy where we use knowledge distillation to force the model with progressively eliminated content to mimic the predictions of the original model with no content elimination.Experiments on three benchmarks indicate that our method can effectively speeds-up SOTA models without much performance degradation and shows a better trade-off between speed and performance than previous methods.
Paper Type: long
Research Area: Dialogue and Interactive Systems
0 Replies

Loading