Self-Attention Augmentation with Smoothed Noise Injection for Enhancing Transformer Fine-Tuning on Temporally Structured Health Data

Self-Attention Augmentation with Smoothed Noise Injection for Enhancing Transformer Fine-Tuning on Temporally Structured Health Data

TMLR Paper3953 Authors

11 Jan 2025 (modified: 17 Jun 2025)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Pre-training Transformer models on self-supervised tasks and fine-tuning them on downstream tasks, even with limited labeled samples, have achieved state-of-the-art performance across various domains. However, learning effective representations from complex temporal structured health data and fine-tuning for clinical risk predictions remains challenging. While self-attention mechanisms are powerful for capturing relationships within sequences, they can struggle to model intricate dependencies in event sequences, especially when training data is limited. Existing solutions often rely on expensive modifications to the pre-training phase. In this work, we propose a novel method, Smoothed Noise Injection Self-attention Augmentation (SNSA), to augment Transformer models during the training. Our approach encourages the self-attention mechanism to effectively learn complex dependencies within input sequences. This is achieved by introducing noise to the self-attention and then smoothing it via convolving with a 2D Gaussian kernel. The first term perturbs the attention between events , encouraging the model to explore diverse attention patterns. The Gaussian smoothing adaptively filters this noise, allowing the model to focus on more relevant events within each sequence. With SNSA, we observe enhanced model performance on downstream tasks. Furthermore, our method sheds light on the model's ability to learn complex relations within a sequence of medical events, providing valuable insights into its behavior within the attention mechanism.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: 1. Title Updated: The title has been revised for clarity and to better reflect the contribution of the proposed method. 2. Introduction Improved: Paragraphs 4, 5, and 6 of the introduction have been rewritten to better articulate the motivation, describe the limitations of previous work, and explain the relevance of SNSA. 3. Figure 1 Revised: Updated for improved clarity and alignment with the revised introduction. 4. Related Work Expanded: This section has been revised to better contextualize SNSA, with deeper discussion of prior methods and their limitations. 5. Clarified SEP Selection: Additional explanation on the selection of the structured event prediction (SEP) task is provided in the problem formulation. 6. Baselines Moved: The discussion of baseline models has been moved to the appendix, as suggested. 7. Figure 3 Updated: Gridlines have been added to improve readability of AUC values. 8. Section Reordering: Sections 5.6 and 5.7 have been reordered and rewritten for improved coherence and flow. 9. New Figure Added: Figure 4 has been introduced to illustrate attention structure limitations in pre-trained Transformers. Section 5.8 was rewritten to describe and analyze this figure. 10. Implementation Details Added: Details on pretraining and fine-tuning have been added in Appendix A.1. 11. Hyperparameter Sensitivity Analysis: A new section (A.3) has been added to provide sensitivity analysis of the key hyperparameter σeh\sigma_{eh}σeh. 12. Justification of SNSA: Appendix Sections A.5 and A.6 provide theoretical and empirical justification for why and how SNSA improves attention behavior and downstream performance.

Assigned Action Editor: ~Tianbao_Yang1

Submission Number: 3953

Loading