Is it Fine to Tune? Evaluating SentenceBERT Fine-tuning for Brazilian Portuguese Text Stream Classification

Published: 01 Jan 2024, Last Modified: 23 Jun 2025IEEE Big Data 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Pre-trained language models (LMs) have been used in several scenarios and data mining tasks due to their good-quality representations and their use readiness. Although LMs constitute a significant gain in usability, they are frequently utilized statically over time, meaning that these models can suffer from concept drift and semantic shift, which correspond to changes in data distribution and word meanings. These phenomena are more noticeable when new texts become gradually available. This paper evaluates the impact of updating pre-trained SentenceBERT models overtime on a Brazilian news post classification task in text streaming fashion, a paradigm suitable for learning from data streams. While we update the SBERT model yearly with a reduced number of recent posts, we compare it with scenarios using static LMs. We used the adaptive random forest for classification and evaluated it regarding macro F1-score and elapsed time. The experimental results show that regularly leveraging sampled texts from the recent past for fine-tuning LMs can improve performance metrics over time, reaching better results than using static LMs in most years analyzed. We also evaluated the run times, which suggests that fine-tuning LMs over time provides a good trade-off between performance and run time.
Loading