Abstract: Natural language processing applications require fast inference from pre-trained models like BERT. One of the most common methods for accelerating this process is the dynamic early exiting method. However, there are two primary limitations: inefficiency caused by the lack of reusing repetitive representations and insufficient flexibility in adapting inference speed to various application scenarios. In this paper, we propose SAP-BERT, a novel approach designed to overcome these issues. SAP-BERT enhances efficiency by integrating a skip-layer computation module with a cache mechanism, which effectively reuses repetitive representations. To address the inflexibility problem, we introduce an adaptive patient early exiting mechanism that merges a patient counter with confidence scores to dynamically adjust inference speed. This approach also effectively optimizes reasoning efficiency. Experimental results demonstrate that SAP-BERT achieves an 82% speedup in BERT inference while maintaining 99% accuracy, validating its superiority.
External IDs:dblp:conf/iconip/JiangTRPQTJ24
Loading