Abstract: In many deep-learning tasks, performance improvements have been achieved through the full fine-tuning of pre-trained models for downstream tasks. Numerous studies have insert an additional layer to a pre-trained model when designing a model for fine-tuning. This additional layer helps optimize the pre-trained model for downstream tasks. In some cases, this additional layer may need to be inserted between the existing middle layers of the pre-trained model . However, most studies have added an additional layer outside the pre-trained model. This is because inserting an additional layer between the pre-trained layers of a pre-trained model can cause performance degradation. In this study, we assume the following reason for the performance degradation: Initializing the additional layer using the existing initialization method with random characteristics and using the activation function changes the output value. We experimentally verified our assumptions by varying the number of additional layers and activation functions. To address this problem, we propose a methodology that initializes a unit tensor and modifies the application of the activation function. The methodology does not modify the output vector during the initial stage of full fine-tuning. We conducted experiments on the various NLP and CV datasets to verify whether the proposed methodology could solve this problem. The code used for the experiments is available on GitHub.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: fine-tuning
Contribution Types: NLP engineering experiment
Languages Studied: English
Section 2 Permission To Publish Peer Reviewers Content Agreement: Authors grant permission for ACL to publish peer reviewers' content
Submission Number: 75
Loading