Performance Metrics for Multilabel Emotion Classification: Comparing Micro, Macro, and Weighted F1-Scores

Maria Cristina Hinojosa Lee; Johan Braet; Johan Springael

Performance Metrics for Multilabel Emotion Classification: Comparing Micro, Macro, and Weighted F1-Scores

Maria Cristina Hinojosa Lee, Johan Braet, Johan Springael

Published: 15 Oct 2025, Last Modified: 30 Apr 2026BNAIC/BeNeLearn 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Track: Type B (Encore Abstracts)

Keywords: Emotion analysis, Multilabel classification, F1-score, performance metrics

Abstract: Abstract. We compare micro, macro and weighted F1-scores for multilabel emotion classification under class imbalance. By using distilled lexicons built from GoEmotions and XED, we compute the three variants at eleven thresholds and cross-evaluate them on the opposite dataset, as well as on aggregated versions of both datasets. The three measures offer a picture of the model performance: Weighted F1 offers a most balanced assessment; micro F1 overestimates performance, while macro F1 penalizes majority classes. We recommend reporting the three performance measures. 1 Introduction Emotion analysis is used in diverse fields such as recommender systems, health, marketing, e-learning, and more. Because several emotions can co-occur in a single message, the task is framed as multilabel classification [3,1]. F1-measure variants (micro, macro, and weighted) are popular to evaluate these tasks and compare the results of different studies [1,5,3]. However, current literature often fails to justify the selection of the variant they report. When class frequencies diverge, the three variants can rank the same system differently, which jeopardizes fair comparison between models and downstream decisions. [3] The contribution of this study is to provide insights over how different F1-score variants compare and how they could improve the evaluation of multilabel emotion classifiers. This is done in the context of class imbalance. 2 Data & Methodology For this study, two popular corpora were used. – GoEmotions: which consists of 58,000 manually emotion annotated English comments from Reddit and originally used 27 fine-grained emotion labels. The GoEmotions authors grouped the emotions hierarchically into the six basic emotions of Ekman: anger, disgust, fear, joy, sadness, and surprise.[2] – XED 30,000 English emotion-annotated movie subtitles labelled with 8 emotions.[6] During the preprocessing we used the hierarchical grouping used by the GoEmotions authors to turn the 27 emotions into the 6 emotions system from Ekman. To be able to compare the results of using both datasets, we dropped anticipation and trust from the XED dataset and the neutral category in both datasets. The GoEmotions dataset is imbalanced (e.g., 21,730 'joy' messages vs. 929 'fear' messages), while the XED dataset is more balanced across its classes.[3] An alternative aggregated form was also applied, which merges anger, disgust, fear, and sadness into a single negative class, for a more balanced 3-label task. We created 11 distilled emotion lexicons per source from the datasets and their aggregated versions by obtaining unigram lexicons from each corpus at different thresholds. The diagrams explaining the process can be consulted in the published article in Sections 3.2 and 3.3. Each lexicon was then applied to the alternative dataset and the three F1-variants were calculated at different thresholds. 3 Results Tables 1 and 2 show the results with the higher scores across the different F-scores. The comprehensive results can be consulted in Section 4 of the published paper.[3] Table 1. XED distilled lexicon applied to GoEmotions (left) and XED aggregated applied to GoEmotions aggregated (right). | Threshold | 6 emotion label Macro F1 | 6 emotion label Micro F1 | 6 emotion label Weighted F1 | 3 emotion label (aggregated) Macro F1 | 3 emotion label (aggregated) Micro F1 | 3 emotion label (aggregated) Weighted F1 | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | 0.200 | 0.290 | 0.346 | 0.522 | 0.524 | 0.556 | 0.601 | | 0.300 | 0.308 | 0.384 | 0.530 | 0.525 | 0.569 | 0.603 | | 0.400 | 0.317 | 0.416 | 0.528 | 0.516 | 0.569 | 0.594 | | 0.500 | 0.317 | 0.426 | 0.519 | 0.509 | 0.563 | 0.584 | Table 2. GoEmotions distilled lexicon applied to XED (left) and GoEmotions aggregated applied to XED aggregated (right). | Threshold | 6 emotion label Macro F1 | 6 emotion label Micro F1 | 6 emotion label Weighted F1 | 3 emotion label (aggregated) Macro F1 | 3 emotion label (aggregated) Micro F1 | 3 emotion label (aggregated) Weighted F1 | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | 0.000 | 0.328 | 0.330 | 0.337 | 0.478 | 0.520 | 0.626 | | 0.100 | 0.307 | 0.339 | 0.319 | 0.481 | 0.526 | 0.629 | | 0.200 | 0.289 | 0.341 | 0.305 | 0.484 | 0.549 | 0.632 | | 0.300 | 0.257 | 0.311 | 0.272 | 0.474 | 0.560 | 0.624 | 4 Conclusions and discussion Following results in literature, we considered a benchmark of 0.53 for the F1-score. We found that this was achieved in the non-aggregated datasets by the weighted F1-score [3]. We recommend reporting all three F1-variants to show whether improvements stem from majority label recall or balanced gains. We also consider that it is relevant to adopt aggregated emotion schemes where robustness outweighs granularity. The study shows that metric choice alone can shift multilabel emotion conclusions. For future work we plan on repeating the experiment using the CancerEMO dataset (clinical forum posts)[4].

Serve As Reviewer: ~Maria_Cristina_Hinojosa_Lee1

Submission Number: 1

Loading