%% The first command in your LaTeX source must be the \documentclass command.
%%
%% Options:
%% twocolumn : Two column layout. Do not use twocolumn for papers submitted to CEUR-WS!
%% hf: enable header and footer.
\documentclass[
% twocolumn,
% hf,
]{ceurart}

%%
%% One can fix some overfulls
\sloppy


% aggiunto per tabelle dei risultati
\usepackage{tabularx}
\usepackage{array}
% colonna X allineata a sinistra
\newcolumntype{Y}{>{\raggedright\arraybackslash}X}
\usepackage{bm}
\usepackage{booktabs}
\usepackage[table]{xcolor}
\usepackage{multirow}
\usepackage{makecell}
\usepackage{caption}
\usepackage{wrapfig}
\usepackage{adjustbox}
\usepackage{hyperref}
\usepackage{todonotes}
%%
%% Minted listings support 
%% Need pygment <http://pygments.org/> <http://pypi.python.org/pypi/Pygments>
\usepackage{listings}
%% auto break lines
\lstset{breaklines=true}
\usepackage{microtype}

\usepackage{subcaption}


%%
%% end of the preamble, start of the body of the document source.
\begin{document}

%%
%% Rights management information.
%% CC-BY is default license.
\copyrightyear{2026}
\copyrightclause{Copyright for this paper by its authors.
  Use permitted under Creative Commons License Attribution 4.0
  International (CC BY 4.0).}

%%
%% This command is for the conference information
\conference{EVALITA 2026: 9th Evaluation Campaign of Natural Language Processing and Speech Tools for Italian, Feb 26 – 27, Bari, IT}

%%
%% The "title" command
\title{GSI:detect at EVALITA 2026: Overview of the Task on Detecting Gender Stereotypes in Italian}

\tnotemark[1]
%\tnotetext[1]{You can use this document as the template for preparing your publication. We recommend using the latest version of the ceurart style.}

%%
%% The "author" command and its associated commands are used to define the authors and their affiliations.
\author[1]{Gloria Comandini}[%
orcid=0000-0003-3406-2819,
email=comandini@studigermanici.it,
url=https://huggingface.co/GloriaComandini,
]
\cormark[1]
%\fnmark[1]
\address[1]{Italian Institute of Germanic Studies (IISG), Rome, Italy}
%\address[2]{Joint Institute for Nuclear Research,
%  6 Joliot-Curie, Dubna, Moscow region, 141980, Russian Federation}

\author[2]{Manuela Speranza}[%
%orcid=0000-0001-7116-9338,
email=manspera@fbk.eu,
%url=https://kmitd.github.io/ilaria/,
]
%\fnmark[1]
\address[2]{Fondazione Bruno Kessler (FBK), Trento, Italy}

\author[2,3]{Sofia Brenna}[%
orcid=0009-0001-3748-1448,
email=sbrenna@fbk.eu,
%url=http://conceptbase.sourceforge.net/mjf/,
]
%\fnmark[1]
\address[3]{Free University of Bozen-Bolzano, Bolzano, Italy}

\author[2,4]{Davide Testa}[%
orcid=0009-0002-2489-5323,
email=dtesta@fbk.eu,
url = https://linktr.ee/davide.testa,
]
%\fnmark[1]
\address[4]{University of Rome La Sapienza, Rome, Italy}

\author[5]{Stefania Cavagnoli}[%
orcid=0000-0003-1677-6455,
email=stefania.cavagnoli@uniroma2.it,
%url=http://conceptbase.sourceforge.net/mjf/,
]
%\fnmark[1]
\address[5]{University of Rome Tor Vergata, Rome, Italy}

\author[2]{Bernardo Magnini}[%
orcid=0000-0002-0740-5778,
email=magnini@fbk.eu,
%url=http://conceptbase.sourceforge.net/mjf/,
]
%\fnmark[1]



%% Footnotes
\cortext[1]{Corresponding author}
%\fntext[1]{DA SISTEMARE}

%%
%% The abstract is a short summary of the work to be presented in the
%% article.
\begin{abstract}
GSI:detect is a new shared task for the recognition and classification of gender stereotypes (GSs) presented at EVALITA 2026. The task adopts a perspectivist approach in order to enhance the high subjectivity of GS recognition and analysis on a dataset of challenging short texts in Italian. GSI:detect is organized in: A) a Main Task (GS Detection) in which systems have to assign to a text the GS value, a numerical score that quantifies the extent to which a given text exhibits or refers to a GS; B) an optional Subtask (GS Classification) in which systems,
% when a text contains a GS, must recognize its category from six possible options (role, relational, etc.). 
% quanto sotto non e' preciso, abbiamo chiesto di assegnarla sempre
given six pre-defined categories (e.g. role, relational, etc.) must assign one to each text. 
Seven teams from academic and non-academic environments took part in the challenge, with a total of 50 submitted runs for the Main Task and a total of 43 submitted runs for the optional Subtask. We present here first an overview of the GSI:detect task, the dataset and the evaluation criteria, then outline and discuss the participants' results.  
%In this context, we organised the GSI:detect task at EVALITA 2026 on detecting gender stereotypes, based on a dataset of naturally occurring language, encompassing non-hateful contexts and able to capture multiple viewpoints as well as the diversity and disagreement inherent in human perception. The task aims at evaluating  systems’ ability to detect and classify GSs across different kinds of short texts.  
%For the subtask on the semantic classification of gender stereotypes, we propose a systematic six-class taxonomy. 
%Seven teams participated in the task and various models were evaluated. In this evaluation, \textbf{TODO: SUMMARY of RESULTS}. 
  \textbf{Content warning}: Examples taken from the GSI:detect dataset may contain sensitive, offensive, or otherwise distressing content.
\end{abstract}
%%
%% Keywords. The author(s) should pick words that accurately describe
%% the work being presented. Separate the keywords with commas.
\begin{keywords}
  gender stereotypes \sep
  perspectivism \sep
  linguistic resource \sep
  evaluation \sep
  LLMs \sep
\end{keywords}

%%
%% This command processes the author and affiliation and title
%% information and builds the first part of the formatted document.

\maketitle

\section{Introduction and Motivation}
\label{sec:motivation}

%Can machines detect gender stereotypes in Italian texts? That's the challenge! Why GSI:detect? Because stereotypes are everywhere! ... and we need to know if AI systems can spot them!

%Related works on gender stereotypes and their detection in NLP. Perspectivist approach.
%Short version of "Related works" in LREC paper.

%\todo[inline]{
%1. L'ABSTRACT SI DEVE ANCORA CAMBIARE - C'è QUELLO DI LREC. GLORIA: sarà cambiato \\ 
%2. RICORDARSI DI DOVERCI DIVIDERE LE SEZIONI DEL PAPER NEGLI ACKNOWLEDGEMENTS. GLORIA: Non è meglio segnalarle in una nota all'inizio? Di solito si fa così, non negli Acknowledgements. DAVIDE: Ho chiesto conferma e mi dicono che per i proceedings si mettono negli ack. Nelle note all'inizio si mette solo affiliation, corresponding author e, se è il caso, i primi due autori con equal contribution sharando il fatto di essere co-primi autori. Ho chiesto a Giulia che è anche tra gli organizzatori di Evalita ed ha avuto problemi in questo senso in passato a livello di concorsi credo.. proprio x non aver messo queste cose negli ack. GLORIA: Ok, buono a sapersi, grazie! \\
%3. Credo che l'introduzione vada integrata e vada fatto anche un piccolo zoom al task in sè + un prospetto di come è diviso il report in sezioni. GLORIA: Appena abbiamo le sezioni concluse, inserisco io quelle parti. GLORIA: sto guardando altri report di Evalita e noto che altri non mettono nell'intro il recap del paper. Per risparmiare spazio, farei lo stesso ed eviterei.
%} 


%MANU, PER FAVORE NON TOCCARE QUESTA PARTE
%Ci voglio mettere mano e vorrei farlo con calma
%Eh, appunto
%Ho finito di leggere il paper e correggere in giro, ora che ho il quadro completo posso tornare all'intro per completarla. 

%ok allora io passo a leggere piu' avanti
%vedi i miei commenti nell'introduzione

%Ok

%Lavori tu anche all'abstract?

%Ho solo tolto la definizione di GS (se no diventa troppo lungo) e spostata nell'intro
%Sì

%La definizione di GS sarebbe abbastanza importante da mettere, però è vero che sforiamo, quindi è meglio asciugare dove si può. Non credo che a chi leggerà interesserà troppo.

%quando? oggi dobbiamo sottomettere


The GSI:detect Task, organised within EVALITA 2026 \cite{evalita2026overview}, aims to take one step further in the state-of-the-art detection of gender stereotypes (GSs)\footnote{We intend GSs as socially constructed beliefs about the 'appropriate' roles, behaviours, and appearances of a person regarding their gender; 
%while in this research we focus 
although GSI:detect focuses
on GSs about men and women for the sake of simplicity, we 
%do not intend gender identity as a binary 
describe gender as "a nonbinary construct" \cite{APA2015}
%gender identity as a spectrum 
%non-binary
and we 
%will also investigate 
intend to include GSs regarding non-binary people in future research.}.
Gender stereotypes have recently been the object of extensive research in the context of automatic recognition of GSs in misogynistic hate speech \cite{Fersinietal2018, Kirketal2023, Plazaetal2023} and also as far as large language models' production of stereotyped and biased material is concerned  in text generation \cite{Caoetal2022, Ovalleetal2023} and translation \cite{Savoldietal2025}.

%\todo[inline]{Sbaglio o questi lavori non sono sull'italiano? io toglierei "in Italian scritto sopra e aggiungerei un paper di Beatrice Savoldi}

However, with GSI:detect we aimed to first expand our focus beyond the context of hate speech, because GSs can also appear in non-hateful communication 
(even as a compliment: "women's nurturing nature will save the world!"). In fact, GSs are sometimes produced even by their own targets (e.g., a man who says "You know, men only have one thing in mind", or a woman who says "I failed in the math exam. Oh well, girls don't do well in math anyway, ah-ah"), who have internalized these biased views on gender.


%In natural language processing (NLP) applied to Italian, gender stereotypes (GSs) have been the object of extensive research, in the context of both automatic recognition of GSs in misogynistic hate speech \cite{Fersinietal2018, Kirketal2023, Plazaetal2023}, and in the production of stereotyped and biased material from large language models (LLMs) \cite{Caoetal2022, Ovalleetal2023}.

%However, for the GSI:detect Task at Evalita 2026 
%% Manuela: ho aggiunto io la citazione, possiamo spostarla
%\cite{evalita2026overview}
%we wanted to take a step further in the state-of-the-art detection of GSs in Italian. First of all, we aimed to expand our focus beyond the context of hate speech, because GSs can also appear in non-hateful communication (even as a compliment: "women's nurturing nature will save the world!"). In fact, GSs are sometimes produced even by their own targets (e.g. a man who says "You know, men only have one thing in mind", or a woman who says "I failed in the math exam. Oh well, girls don't do well in math anyway, ah-ah"), who have internalized these biased views on gender.

%As it can be seen from these examples, 
Secondly, we want to underline that the recognition and therefore the analysis of stereotypes can be deeply subjective, as seen for example in the low inter-annotator agreement (IAA) (0.41, Cohen's \textit{k}) in the recognition of racist stereotypes by \cite{Sanguinettietal2018}. 
This situation is common to most highly subjective NLP tasks, such as hate speech recognition \cite{Wojatzkietal2018}, irony detection \cite{Sanguinettietal2018} or sentiment analysis applied to complex texts (e.g. newspaper articles or latin poetry) \cite{Krusic2024, Comandini2025, Sprugnolietal2023}, where annotators judgments can be influenced by several factors, such as agreement or disagreement with a statement, the fact that the annotator belongs to the group targeted by hate or stereotypes \cite{Wojatzkietal2018}, personal opinions \cite{Klenneretal2020} and interiorized biases \cite{Basileetal2023, Muscatoetal2024}.

For all these reasons, we decided to adopt a Perspectivist Approach \cite{Basile2020, Basileetal2023, RizosSchuller2020, Madedduetal2023, Muscatoetal2024}, in order to treat the inherent diversity in judgments and perceptions as a valuable asset, and not as a problem to solve. In fact, it has been observed that the classical methods of majority voting used to create gold-standard corpora may result in the suppression of perfectly valid points of view, which may even add interesting information about a task \cite{Klenneretal2020}.

%GLORIA: sto guardando altri report di Evalita e noto che altri non mettono nell'intro il recap del paper. Per risparmiare spazio, farei lo stesso ed eviterei.



%****************************************************
%****************************************************
%****************************************************



\section{Task Description}
\label{sec:task-description}


GSI:detect aims to evaluate systems' ability to detect and classify GSs across different kinds of short texts; it includes both a compulsory main task and an optional sub-task.

\subsection{Main Task: Gender Stereotype Detection}
\label{sec:GS-detection}


\begin{table*}[ht]\scriptsize
\centering
\renewcommand{\arraystretch}{1.2} % More space between rows
\captionsetup{font=footnotesize}
\begin{adjustbox}{max width=0.90\textwidth}
\begin{tabular}{p{0.1\textwidth} p{0.7\textwidth}}
\toprule
\textbf{GS value} & \textbf{Example} \\ 
\midrule
 \centering 0 & \scriptsize Non comprendo come si possano paragonare due fenomeni, gravissimi entrambi e concordo, come femminicidi e morti sul lavoro. \textit{(I don't understand how one can compare two phenomena, both very serious and I agree, such as feminicides and deaths at work.)} \\
\centering0.25 & \scriptsize Tenete duro ancora qualche giorno e i vostri fidanzati partiranno in vacanza con le loro mogli. \textit{(Hold on for a few more days and your boyfriends will be going on vacation with their wives.)}\\
\centering0.50 & \scriptsize Io rimango dell'idea che un figlio ha sempre bisogno della sua mamma, anche per dire buongiorno e buona notte. E la mamma idem. Soprattutto la mamma \textit{(I still think an [adult] child always needs his/her mother, even to say good morning and good night. And the mother too. Especially the mother}\\
\centering0.75 & \scriptsize [Commento ad articolo di giornale dal titolo "Negli Usa quasi un manager su due è donna. In Italia meno di 1 su 3"] Infatti il Made usa va’ peggio del Made italy \textit{([Comment on a newspaper article titled "In the US, almost one in two managers is a woman. In Italy, less than one in three"] In fact, Made in USA is doing worse than Made in Italy)}\\
 \centering1 & \scriptsize [Rivolto a una utente donna] fatevi voi una doccia e copritevi. Le donne vere si coprono. Gli animali vanno in giro nudi. \textit{([Addressed to a female user] Take a shower and cover up. Real women cover up. Animals walk around naked.)}\\
\bottomrule
\end{tabular}
\end{adjustbox}
\caption{\label{GSvalueExamples}
Sample texts annotated with GS values.}
%\caption{\label{GSvalueComputation} Corresponding GS value to the five different combinations of label's choice made by the four annotators (non aggregated labels), with example items taken from the Development set.}
\end{table*}




Given a short text, the main task requires systems to assign to it a numerical score, the GS value, that quantifies the degree to which the text contains or refers to a gender stereotype.

This is formulated as a regression task in which GS values are real numbers in the range [0,1], where 1 indicates the maximum degree of stereotypical content (the granularity is fixed at two decimal places).
A few examples of short texts annotated with different GS values are reported in Table \ref{GSvalueExamples}.

%Le donne non sono portate per la matematica e la logica, sono meglio nelle materie umanistiche. GS value: 1



Notice that we consider two types of texts:
\begin{itemize}
    \item \textsc{no context}: texts that can be understood without any additional contextual information (see the first three examples in Table \ref{GSvalueExamples});
    \item \textsc{with context}: texts that are not self contained and are therefore enriched with contextual information in the form of standardized metadata (see the last two examples in Table \ref{GSvalueExamples}).
\end{itemize}





\subsection{Subtask: Gender Stereotype Classification}
\label{sec:GS-classification}



The subtask on GS classification is formulated as a multi-class classification task where, given a short text, systems are required to assign a 
%predefined 
GS category to that text; 
assignment must be performed for each single text, independently of the GS value assigned by the systems in the main Task.
Participation in the subtask  was not compulsory; still, participants were strongly encouraged  to submit their results to allow for a more comprehensive evaluation of the phenomenon.

% eg: Le donne non sono portate per la matematica e la logica, sono meglio nelle materie umanistiche. GS category: Competence


\textbf{\begin{table*}[ht]\scriptsize
\centering
\renewcommand{\arraystretch}{1.2} % More space between rows
\captionsetup{font=small}
\begin{adjustbox}{max width=0.80\textwidth}
\begin{tabular}{p{0.10\textwidth}  p{0.8\textwidth}}
\toprule
\textbf{Category}  & \textbf{Example} \\ 
\midrule
\textsc{Role} &  \scriptsize Cento uomini possono creare un accampamento, ma serve una donna per fare una casa. \textit{(A hundred men can build a camp, but it takes a woman to make a home.)}\\
\textsc{Personality} &  \scriptsize Sentivo qualcosa di speciale e sai, una donna non sbaglia mai le sensazioni. \textit{(I felt something special and you know, a woman never mistakes her feelings.)}\\
\textsc{Competence} &  \scriptsize [Commento ad articolo con titolo "La pilota della British Airways ubriaca in volo: cacciata dall’aereo, aggredisce pure i poliziotti"] Come si possono affidare le sorti di un aereo ad una donna?
....scherzo, naturalmente..." \textit{([Comment on an article titled "British Airways pilot drunk on flight: kicked off plane, she even attacks police"] How can you trust a plane's fate to a woman?
....just kidding, of course...")}\\
\textsc{Physical} &  \scriptsize "Oppure c'hanno le 5\^{}, vanno in giro scollate come i manifesti messi d'inverno,e poi se rimani ""attirato"" dalle loro protuberanze ci rimangono male
Povere cucciole." \textit{("Or they are a size D, they walk around with low-cut clothes like winter posters, and then if you get ""attracted"" by their protuberances, they get upset.
Poor little things.")}\\
\textsc{Sexual}  &  \scriptsize [Rivolto a una utente donna] fatevi voi una doccia e copritevi. Le donne vere si coprono. Gli animali vanno in giro nudi. \textit{([Addressed to a female user] Take a shower and cover up. Real women cover up. Animals walk around naked.)}\\
\textsc{Relational} &  \scriptsize [Commento a meme con testo "Aspettavo che mi mandassi tu un messaggio" e sotto l'immagine di un uomo vestito da principessa] Tipico post da zitella \textit{([Comment on a meme with the text "I was waiting for you to text me" and underneath a picture of a man dressed as a princess] Typical spinster post)}\\
\bottomrule
\end{tabular}
\end{adjustbox}
\caption{\label{table:GScategoryValues} Examples of texts assigned to the different GS categories.}
\end{table*}}



The classification we propose, developed to capture the variety of ways in which stereotypes manifest in language and to support both linguistic analysis and automatic detection tasks, foresees the following GSs typologies (examples of texts assigned to the different categories are provided in Table \ref{table:GScategoryValues})\footnote{Note that this taxonomy is not intended to be fully exhaustive, as it is derived from an abstraction over the direct observation of the examples contained in the dataset.}: 

\begin{itemize}    
\item \textsc{role} stereotypes:
social and cultural expectations about what women and men should do and about how they should be;
\item \textsc{personality} stereotypes: 
emotional and behavioural traits assigned to men and women based on their gender;
\item \textsc{competence} stereotypes: 
generalized judgments of a person's abilities based on their gender;
\item \textsc{physical} stereotypes: 
expectations about the physical appearance of men and (especially) women, and all aspects of personal care in general;
\item \textsc{sexual} stereotypes:  
attitude and behaviour that men and women should have regarding sexuality;
\item \textsc{relational} stereotypes:  
the way in which women and men should behave in interpersonal/sentimental relations.
\end{itemize}


For a more accurate understanding of the task, participants were able to refer to the official guidelines for stereotype classification, which were followed during the manual annotation of the dataset.
\footnote{The annotation guidelines are available for download at this \href{https://drive.google.com/file/d/17TQPHkDQDBFilcl88fprVC43sKSl4DVA/preview}{link}.}






%**********************************************************

\section{Dataset}
\label{sec:dataset}

%NOTA di GLORIA a Gloria: (DA FARE ASSOLUTAMENTE DOMANI!) Cambia la forma perché è identica al paper di LREC


The GSI:detect  dataset\footnote{The GSI:detect dataset is distributed under a CC BY-NC-SA 4.0 Licence. The dataset is publicly available at this \href{https://github.com/Caput97/GSI_detect}{link}. The distributed dataset includes, besides the GS values, also the individual, non-aggregated labels assigned by all annotators, in order to enable systems to learn from annotator disagreement \cite{Madedduetal2023}.} consists of 1,010 short written texts in Italian (for a total of 52,118 tokens), collected from social media and informative websites.
%As 
%it will be 
%explained 
%below,
%in Saction \ref{sec:data_coll}, 
%these texts were gathered in order to both capture how language occurs in natural and authentic situations, and represent a wide range of communicative contexts in which GSs may appear at different levels of prototypicality, or not at all. 



%\subsection{Data Collection}
%\label{sec:data_coll}

The texts have been manually collected from a diverse array of online spaces to provide a balanced representation of formal and informal written Italian, a variety that also allows us to explore the theme of GSs in different contexts. 
%family, romantic relationships, sports, politics, entertainment, social rights, physical and mental abilities
%The texts of GSI:detect 
In fact, they include excerpts from information websites, as well as users' comments from Facebook, Instagram and Reddit pages and groups discussing both gender-related issues and more generic topics (e.g. feminist influencers, pick-up artists, ``mom influencers", parodic pages, math groups, gossip pages, major Italian newspapers, etc.), 
%gossip or pseudo-scientific speculations; b) from major Italian newspapers on social media; c) Facebook groups about math and chess; d) Reddit groups discussing dating and relationships. 

%The texts of GSI:detect include:
%\todo{secondo me tutta questa lista seguente può benissimo essere riassunta in brevissimo e recuperare spazio. GLORIA: Ok, lasciatemela esportare in LREC, perché lì ci sta bene.}



%\textbf{Manuela}: ADD ANCHE GENERICAMENTE SITI, non solo social 

 

% Furthermore, the dataset includes GSs regarding not only women, 
% %(Example~\ref{ex_women})
% but also men,
% %(Example~\ref{ex_men})
% or both,
% %men and women. 
% %(Example~\ref{ex_both})
% %\footnote{This choice was deliberate, as the aim of this work is to represent the issue of GSs as comprehensively as possible, encompassing all its manifestations and targets.}
% encompassing
% %the GSI:detect dataset includes not only 
% texts that clearly convey GSs as well as non stereotypical examples, and instances in which the recognition of GSs might be a matter of personal opinion and sensitivity. 
Furthermore, the dataset includes GSs about not only women, but also men, or both, including texts that clearly express GSs as well as non-stereotypical examples, as well as cases where recognizing GSs may be a question of personal opinion and sensitivity. 
%as it will be explained in Section \ref{sec:Annotation}.

\begin{comment}
    

\begin{enumerate}
    \item \label{ex_women} \textsc{women as target:}
    Vabbè oggettivamente le femmine su alcune cose non sono in grado. XD Fagli cambiare una ruota di scorta XD (\textit{Come on objectively women are not capable of some things. XD Make them change a tire XD})
    \item \label{ex_men} \textsc{men as target:}
    [Commento ad articolo di giornale dal titolo "Il corpo di ballo di Marco Mengoni balla al ristorante sulle note di 'Mi fiderò'"] Li vedo bene in guerra contro i russi. XDXDXD (\textit{[Comment to a newspaper article titled "Marco Mengoni's dance troupe dances at the restaurant to the notes of 'Mi fiderò'"] I can see them doing well at war against the Russians XDXDXD})
    \item \label{ex_both} \textsc{men and women as target:}
    Paga l'uomo... Se paga lei è lei l'uomo della coppia. (\textit{The man should pay... If she pays, she is the man of the couple.})
\end{enumerate}
\end{comment}


\begin{comment}

\begin{enumerate}
    \item \textit{Le parole sono femmine e i fatti sono maschi} (Words are female and facts are male)
    \item \textit{[Commento ad articolo di giornale dal titolo "Meloni: "Rispetto il conflitto, ma non se antagonista per principio""] Questa è totalmente andata.} ([Comment to a newspaper article titled "Meloni: "I respect conflict, but not if antagonizing by principle""] She is totally gone)
\end{enumerate}

\begin{enumerate}
\setcounter{enumi}{2}
    \item \textit{Vabbè oggettivamente le femmine su alcune cose non sono in grado. XD Fagli cambiare una ruota di scorta XD} (Come on objectively the females are not capable on some things. XD Make them change a tire XD)
    \item \textit{[Commento ad articolo di giornale dal titolo "Il corpo di ballo di Marco Mengoni balla al ristorante sulle note di 'Mi fiderò'"] Li vedo bene in guerra contro i russi. XDXDXD} ([Comment to a newspaper article titled "Marco Mengoni's dance troup dances at the restaurant to the notes of 'Mi fiderò'"] I can see them doing well at war against the Russians XDXDXD)
    \item \textit{Paga l'uomo... Se paga lei è lei l'uomo della coppia} (The man should pay... If she pays, she is the man of the couple)
\end{enumerate}

\end{comment}


\subsection{Data Annotation}
\label{sec:Annotation}


The GSI:detect dataset has been manually annotated by four expert annotators.
%\footnote{In line with the guidelines proposed by \citet{Basileetal2023}, we include more detailed information about the annotators. The team consists of four  Italian native speakers, all of them attentive to gender-related issues. The annotators are all cisgender, three women and one man, with two people aged between 20 and 30, one between 30 and 40, and one above 40. Regarding educational background, one annotator holds a PhD, while the remaining three have completed a master’s degree.}
%who have spent around three weeks training on the subject and discussing the annotation guidelines.\footnote{The annotation guidelines are available for download at this \href{https://drive.google.com/file/d/17TQPHkDQDBFilcl88fprVC43sKSl4DVA/preview}{link}.} 
%The combination of an extensive training phase and the involvement of multiple expert annotators ensured a shared understanding of the task and consistency in the application of the criteria, thereby contributing to the overall quality and reliability of the dataset.
%The annotation effort, not including training and guidelines definition, can be quantified in a total of fifteen working days.

%Furthermore, as one of the key contributions of this work, we propose a new taxonomy for the semantic classification of gender stereotypes, with each category representing a different dimension of this phenomenon.
%This classification, which is outlined in the annotation guidelines above, was developed to capture the variety of ways in which stereotypes manifest in language and to support both linguistic analysis and automatic detection tasks.

%For each text, the following information is provided:
%\begin{itemize}
    %\item \textbf{GS value}: a number in the interval [0-1] indicating the degree to which the text reflects or refers to a gender stereotype (where 1 is the maximum and 0 is the minimum GS degree); 
%\item \textbf{GS category}: the category to which the gender stereotype (if present) belongs. 
%\end{itemize}

%\todo[]{Reinserire qualche parola in più?} -> Gloria: no, perché non abbiamo lo spazio


\paragraph{\textbf{GS Value Annotation.}}

Although 
%the dataset was annotated by four expert annotators, 
all four annotators were expert and followed the annotation guidelines specifically created for GSI:detect (as mentioned in Section \ref{sec:GS-classification}), the inherent subjectivity of the task inevitably introduced a certain level of disagreement. Following the perspectivist approach introduced in Section \ref{sec:motivation}, we opted for merging all annotations into a numerical GS value, rather than selecting a binary label obtained through annotation aggregation on the basis of majority voting.
This choice aligns with recent findings which indicate that leveraging disagreement is more convenient than effortlessly trying to eliminate it \cite{Basileetal2023,Muscatoetal2024}.

%\textbf{Manuela}Check reference a sezione su perspettivismo
The overall annotation procedure consists of two steps: (i) each annotator manually assigns, for each short text, a binary label \textit{yes}/\textit{no} indicating whether or not the text contains or refers to a GS; (ii) the final GS value is computed by combining the four individual annotations. 
The underlying assumption is that full IAA (all four annotators choose the label \textit{no} or the label \textit{yes}) corresponds to the endpoints of the continuum, while disagreement between annotators indicates intermediate GS values, such as 0.25 (three \textit{no} labels and one \textit{yes} label), 0.5 (two \textit{yes} labels and two \textit{no} labels), and 0.75 (three \textit{yes} labels and one \textit{no} label). 
%as shown in Table \ref{GSvalueComputation}. 

%\begin{wraptable}{r}{0.30\textwidth}
%\centering
%\captionsetup{font=small}
%\begin{tabular}{ll}
%\hline
%\textbf{Labels} & \textbf{GS value} \\ \hline
%no-no-no-no & 0 \\
%yes-no-no-no & 0.25  \\
%yes-yes-no-no & 0.50  \\
%yes-yes-yes-no & 0.75  \\
%yes-yes-yes-yes & 1  \\ \hline
%\end{tabular}
%\caption{\label{GSvalueComputation}GS values corresponding to the possible combinations of annotators' labels (non aggregated labels).}
%\end{wraptable}


%if all four annotators agree that there is no GS, the resulting GS value is 0; if all four annotators agree that there is a GS indeed, then the resulting GS value is 1.
%On the other hand, disagreement between the annotators in the selection of the binary label is supposed to indicate intermediate GS values, such as 0.25 (three \textit{no} labels and one \textit{yes} label), 0.5 (two \textit{yes} labels and two \textit{no} labels), and 0.75 (three \textit{yes} labels and one \textit{no} label).
The overall IAA between the four annotators on the choice of the \textit{yes} or \textit{no} label is 0.61 (Fleiss' \textit{k}), which is a moderate agreement, common in highly subjective tasks \cite{Artstein2017}, as seen in Section \ref{sec:motivation}.


\begin{comment}
    

\begin{table}[ht]
\begin{tabular}{ll}
\hline
\textbf{Labels} &\textbf{GS value} \\ \hline
no-no-no-no & 0 \\
yes-no-no-no & 0.25  \\
yes-yes-no-no & 0.50  \\
yes-yes-yes-no & 0.75  \\
yes-yes-yes-yes & 1  \\ \hline
\end{tabular}
\caption{\label{GSvalueComputation}GS values corresponding to the possible combinations of annotators' labels (non aggregated labels).}
\end{table}

\end{comment}


\paragraph{\textbf{GS Category Annotation.}}

When annotators identify a text as containing or referring to a GS, they additionally assign it to one of the six GS categories, following the classification outlined in the annotation guidelines and summarized in Section \ref{sec:GS-classification}. Due to the more explorative nature of GS category annotation compared to the preceding annotation level, we adopted a more conventional strategy: in case of disagreement between the annotators, a single category is determined by majority vote, with ties resolved by a GS expert acting as a super-judge (required for 6\% of the dataset).
%\footnote{This procedure was required for only 60 of the 1,010 entries, corresponding to approximately 6\% of the entire dataset.}.
Regarding the GS category annotation, 
%(Figure~\ref{fig:heatmaps_IAA}b), 
the four experts scored 
%another 
moderate agreement \cite{Artstein2017}, with a IAA of 0.61 (Fleiss' \textit{k}).
%\todo[]{Controllare i Fleiss k di questo e del GS value, che qui sono uguali 0.61} -> GLORIA: Ma sono cose diverse!

%\begin{figure*}[h]
%    \centering
    % --- Sottofigura A ---
%    \begin{minipage}{0.40\linewidth}
%        \centering
%        \includegraphics[width=\linewidth]{media/heatmap_GS_value3.pdf}
%        \caption*{(a) IAA -- GS\_value}
%        \label{fig:heatmap_value}
%    \end{minipage}
%    \hfill
    % --- Sottofigura B ---
%    \begin{minipage}{0.40\linewidth}
%        \centering
%        \includegraphics[width=\linewidth]{media/heatmap_GS_category3.pdf}
%        \caption*{(b) IAA -- GS\_category}
%        \label{fig:heatmap_category}
%    \end{minipage}

%    \caption{Comparison of IAA across the two annotation tasks, i.e.   (a) GS value and (b) GS category annotation. Each heatmap visualizes pairwise agreement between annotators.}
%    \label{fig:heatmaps_IAA}
% \end{figure*}
 
%While in this work we consider the diversity of annotators' perspectives as an added value rather than a limitation, we nonetheless interpret the lower pairwise agreement values (e.g., between A1 and A4) as partially resulting from the annotators' differing personal and demographic profiles, briefly introduced earlier in Section
%\textbf{FIND or ADD}.
%~\ref{sec:Annotation}.
%GLORIA: direi di togliere questa parte sopra, perché non mostriamo l'accordo tra singoli annotatori (e non penso sia molto rilevante per questo paper) (teniamocelo per LREC!)


%\textbf{Inter-Annotator Agreement}

%\textbf{Manuela}: io metterei solo il valore di agreement generale  e terrei i dettagli per avere qualcosa di nuovo nel paper per il workshop LREC.
%Gloria: idem, sono d'accordo. E poi, allungheremmo troppo il testo, che già è lungo di suo

%The total IAA between the four annotators on the choice of the \textit{yes} or \textit{no} label is 0.61 (Fleiss' \textit{k}), which is a moderate agreement.

%Figure~\ref{fig:heatmaps_IAA} provides an overview of the inter-annotator agreement (IAA) among the four experts, visualized through pairwise Cohen’s \textit{k} values for both the GS value and GS category annotations. The two heatmaps respectively illustrate the agreement patterns for the numerical and categorical scoring schemes.

%As shown in Figure~\ref{fig:heatmaps_IAA}a, A2 and A3 have the highest agreement (0.679), A1 and A4 have the lowest agreement (0.486). A1 has the lowest average pairwise IAA (0.579), followed by A4 (0.583), A3 (0.631) and A2 (0.659).


%Regarding the GS category annotation (Figure~\ref{fig:heatmaps_IAA}b), the four experts scored another moderate agreement, with a IAA of 0.61 (Fleiss' \textit{k}). 
%Inspecting again the pairwise IAA values (Cohen's \textit{k}), A2 and A3 have again the highest agreement (0.684), while A1 and A4 have the lowest agreement (0.509). In this case, however, A4 is the one with the lowest average pairwise IAA (0.573), followed by A1 (0.587), A3 (0.622) and A2 (0.657).

%While in this work we consider the diversity of annotators' perspectives as an added value rather than a limitation, we nonetheless interpret the lower pairwise agreement values (e.g., between A1 and A4) as partially resulting from the annotators' differing personal and demographic profiles, briefly introduced earlier in Section~\ref{sec:Annotation}.


%\subsection{Statistics}
%\label{sec:Analysis}


\subsection{Data Statistics: Test and Development Split}
\label{sec:Analysis}


The complete dataset 
%contains 1,010 texts and 
is divided as follows: 80\% is allocated to the test set for the official evaluation and ranking of participant systems, while the remaining 20\% constitutes the development data (dev set) (refer to Table \ref{table:summarySize} for more details).
These proportions were chosen to balance the need for adequate data for model tuning with the goal of maintaining a larger and more representative test set for the final evaluation.
%\todo[]{and to simulate a low-resource annotated data scenario?} -> Da approfondire e ripensarci per LREC

\begin{wraptable}{r}{0.48\columnwidth}
\centering
\captionsetup{font=small}

\begin{adjustbox}{max width=0.85\linewidth}
\footnotesize
\setlength{\tabcolsep}{4pt}
\begin{tabular}{lrrr}
\toprule
                    & \textbf{Dev set} & \textbf{Test set} & \textbf{Total}\\ 
\midrule
\textsc{With Context} texts  & 82  & 323 & 405  \\
\textsc{No Context} texts    & 118 & 487 & 605  \\
All Texts               & 200 & 810 & 1010 \\
Tokens  & 10,055 & 42,063 &52,118 \\
Av. Length & 50.27 & 51.93 & 51.6\\
\bottomrule
\end{tabular}
\end{adjustbox}

\caption{\label{table:summarySize}Dataset's statistics.}
\end{wraptable}

Table \ref{table:summarySize} reports also detailed information about the size of the dataset in terms of tokens\footnote{The token count was computed using the Italian rule-based tokenizer included in the \textit{spaCy} library (https://spacy.io, version 3.8.7) as part of the \textit{it\_core\_news\_sm} linguistic model. The average length of texts is 51.6 tokens.}. In both the dev and test sets, approximately 58\% of texts are \textsc{with context} and 41\% are \textsc{no context} (see Section \ref{sec:GS-detection}).
In the creation of the dev and test sets, particular care was taken to ensure a balanced distribution of examples across both subsets, as shown in Table~\ref{table:summaryGSvalue}. 
%As shown in , the overall proportion of items associated with each GS value (i.e., the percentages shown in the \textit{Total \%} column) is approximately mirrored in the composition of both the dev and test set. 
%when considering the ratio between the number of items per GS value and the total size of the respective subset.
%tolgo questa parte perché mi sembra ridondante
%This suggests that the 
Therefore, the split preserves the original distribution of GS values, thereby guaranteeing a consistent representation of varying degrees of stereotypicality in both subsets.
A comparable level of balance is also observed for GS categories (see Table~\ref{table:summaryCategory}).
%with the relative proportions of the six stereotype dimensions remaining consistent between the dev and test sets.


%33.3 if we consider pure original text and 51.6 in we include the added contextual information. The context metadata (added to 405 texts) consist on average of 45.5 tokens.

\begin{comment}
    
%*********************************
\begin{table}[t]
\centering

\begin{minipage}{0.48\columnwidth}
\centering
\footnotesize
\setlength{\tabcolsep}{4pt}
\captionsetup{font=small}
\begin{tabular}{lrrr}
\toprule
                    & \textbf{Dev set} & \textbf{Test set} & \textbf{Total}\\ 
\midrule
\textsc{With Context} texts  & 82  & 323 & 405  \\
\textsc{No Context} texts    & 118 & 487 & 605  \\
Total               & 200 & 810 & 1010 \\
\bottomrule
\end{tabular}
\caption{Dataset's split.}
\label{table:summarySize}
\end{minipage}
\hfill
\begin{minipage}{0.48\columnwidth}
\centering
\footnotesize
\setlength{\tabcolsep}{4pt}
\captionsetup{font=small}
\begin{tabular}{lrrr}
\toprule
              & \textbf{Dev set} & \textbf{Test set} & \textbf{Whole} \\ 
\midrule
Tokens    & 10,055 & 42,063  & 52,118 \\
Texts & 200 & 810 & 1010 \\
Av. length & 50.27 & 51.93 & 51.6 \\
\bottomrule
\end{tabular}
\caption{Dataset's size.}
\label{table:summaryData2}
\end{minipage}

\end{table}
%*********************************
\end{comment}


\begin{comment}

%%%%% ULTIMA VERSIONE UTILIZZATA PER TABELLA 3 E 4 SU DATASET SIZE

\begin{table}[t]
\centering

\begin{subtable}[t]{0.48\columnwidth}
\centering
\captionsetup{font=small}
\begin{adjustbox}{max width=0.60\linewidth}
\footnotesize
\setlength{\tabcolsep}{4pt}
\begin{tabular}{lrrr}
\toprule
                    & \textbf{Dev set} & \textbf{Test set} & \textbf{Total}\\ 
\midrule
\textsc{With Context} texts  & 82  & 323 & 405  \\
\textsc{No Context} texts    & 118 & 487 & 605  \\
Total               & 200 & 810 & 1010 \\
\bottomrule
\end{tabular}
\end{adjustbox}
\caption{Dataset's split.}
\label{table:summarySize}
\end{subtable}
\hfill
\begin{subtable}[t]{0.48\columnwidth}
\centering
\captionsetup{font=small}
\begin{adjustbox}{max width=0.60\linewidth}
\footnotesize
\setlength{\tabcolsep}{4pt}
\begin{tabular}{lrrr}
\toprule
              & \textbf{Dev set} & \textbf{Test set} & \textbf{Whole} \\ 
\midrule
Tokens    & 10,055 & 42,063  & 52,118 \\
Texts & 200 & 810 & 1010 \\
Av. length & 50.27 & 51.93 & 51.6 \\
\bottomrule
\end{tabular}
\end{adjustbox}
\caption{Dataset's size.}
\label{table:summaryData2}
\end{subtable}
\caption{Summary statistics of the dataset.}
\label{tab:dataset_summary}
\end{table}


\end{comment}


This careful selection ensures that both subsets are representative of the overall GSI:detect dataset, 
%therefore offering a reliable foundation for model tuning and evaluation, 
preventing unintended biases in the distribution of categories.

\begin{comment}
    

\begin{table}[t]
\centering

\begin{minipage}{0.48\columnwidth}
\centering
\begin{tabular}{lrrr}
\toprule
                    & \textbf{Dev set} & \textbf{Test set} & \textbf{Total}\\ 
\midrule
\textsc{With Context} texts  & 82  & 323 & 405  \\
\textsc{No Context} texts    & 118 & 487 & 605  \\
Total               & 200 & 810 & 1010 \\
\bottomrule
\end{tabular}
\caption{Dataset's size and split.}
\label{table:summarySize}
\end{minipage}
\hfill
\begin{minipage}{0.48\columnwidth}
\centering
\begin{tabular}{lrrr}
\toprule
              & \textbf{Tokens} & \textbf{Items} & \textbf{Av. length} \\ 
\midrule
Texts only    & 33,673 & 1010 & 33.3 tok. \\
Contexts only & 18,445 & 405  & 45.5 tok. \\
Whole dataset & 52,118 & 1010 & 51.6 tok. \\
\bottomrule
\end{tabular}
\caption{Dataset's size in details.}
\label{table:summaryData2}
\end{minipage}

\end{table}

\end{comment}







\begin{comment}
\begin{table}
\centering
\begin{tabular}{lrrrr}
\toprule
GS value & \textbf{Dev set} & \textbf{Test set} & \textbf{Total} &  \textbf{Total\%} \\ \midrule
0    &  60 & 242 & 302 & 29.90\% \\
0.25 &  25 &  84 & 109 & 10.79\% \\
0.50 &  27 &  85 & 112 & 11.09\% \\
0.75 &  25 & 105 & 130 & 12.87\% \\ 
1    &  63 & 294 & 357 & 35.35\% \\ \midrule
     & 200 & 810 & 1010 \\
\bottomrule
\end{tabular}
\caption{\label{table:summaryGSvalue} Dataset distribution by GS value.}
\end{table}

\begin{table}
\centering
\begin{tabular}{lrrrr}
\toprule
Category & \textbf{Dev set} & \textbf{Test set} & \textbf{Total} & \textbf{Total\%}  \\ \midrule
Role           & 30 & 107 & 137 & 13.56 \% \\
Personality    & 29 & 108 & 137 & 13.56\% \\
Competence     & 34 & 120 & 154 & 15.25\% \\
Physical       & 20 & 90 & 110 & 10.89\% \\
Sexual         & 14 & 72 & 86 & 8.52\% \\
Relational     & 13 & 71 & 84 & 8.32\% \\ 
GS value = 0   & 60 & 242 & 302 & 29.90\%   \\    \midrule
            & 200 & 810 & 1010 & \\
\bottomrule
\end{tabular}
\caption{\label{table:summaryCategory} Dataset distribution by GS category.}
\end{table}


\end{comment}


\begin{comment}
    

\begin{table}[t]
\centering
%\scriptsize

% ---------- LEFT TABLE ----------
\begin{minipage}{0.48\textwidth}
\centering
\footnotesize
\setlength{\tabcolsep}{4pt}
\captionsetup{font=small}
\begin{tabular}{lrrrr}
\toprule
GS value & \textbf{Dev set} & \textbf{Test set} & \textbf{Total} & \textbf{Total\%} \\ 
\midrule
0    &  60 & 242 & 302 & 29.90\% \\
0.25 &  25 &  84 & 109 & 10.79\% \\
0.50 &  27 &  85 & 112 & 11.09\% \\
0.75 &  25 & 105 & 130 & 12.87\% \\ 
1    &  63 & 294 & 357 & 35.35\% \\ 
\midrule
     & 200 & 810 & 1010 \\
\bottomrule
\end{tabular}
\caption{\label{table:summaryGSvalue}Dataset distribution by GS value.}
\end{minipage}
\hfill
% ---------- RIGHT TABLE ----------
\begin{minipage}{0.48\textwidth}
\centering
\footnotesize
\setlength{\tabcolsep}{4pt}
\captionsetup{font=small}
\begin{tabular}{lrrrr}
\toprule
Category & \textbf{Dev set} & \textbf{Test set} & \textbf{Total} & \textbf{Total\%}  \\ 
\midrule
Role           & 30 & 107 & 137 & 13.56\% \\
Personality    & 29 & 108 & 137 & 13.56\% \\
Competence     & 34 & 120 & 154 & 15.25\% \\
Physical       & 20 & 90 & 110 & 10.89\% \\
Sexual         & 14 & 72 & 86 & 8.52\% \\
Relational     & 13 & 71 & 84 & 8.32\% \\ 
GS value = 0   & 60 & 242 & 302 & 29.90\% \\    
\midrule
               & 200 & 810 & 1010 & \\
\bottomrule
\end{tabular}
\caption{\label{table:summaryCategory}Dataset distribution by GS category.}
\end{minipage}

\end{table}

\end{comment}


\begin{table}[t]
\centering

% ---------- LEFT TABLE ----------
\begin{subtable}[t]{0.45\columnwidth}
\centering
\captionsetup{font=small}
\begin{adjustbox}{max width=0.85\linewidth}
\footnotesize
\setlength{\tabcolsep}{4pt}
\begin{tabular}{lrrrr}
\toprule
GS value & \textbf{Dev set} & \textbf{Test set} & \textbf{Total} & \textbf{Total\%} \\ 
\midrule
0    &  60 & 242 & 302 & 29.90\% \\
0.25 &  25 &  84 & 109 & 10.79\% \\
0.50 &  27 &  85 & 112 & 11.09\% \\
0.75 &  25 & 105 & 130 & 12.87\% \\ 
1    &  63 & 294 & 357 & 35.35\% \\ 
\midrule
     & 200 & 810 & 1010 \\
\bottomrule
\end{tabular}
\end{adjustbox}
\caption{\label{table:summaryGSvalue}Dataset distribution by GS value.}
\end{subtable}
% ---------- RIGHT TABLE ----------
\begin{subtable}[t]{0.45\columnwidth}
\centering
\captionsetup{font=small}
\begin{adjustbox}{max width=0.85\linewidth}
\footnotesize
\setlength{\tabcolsep}{4pt}
\begin{tabular}{lrrrr}
\toprule
Category & \textbf{Dev set} & \textbf{Test set} & \textbf{Total} & \textbf{Total\%}  \\ 
\midrule
Role           & 30 & 107 & 137 & 13.56\% \\
Personality    & 29 & 108 & 137 & 13.56\% \\
Competence     & 34 & 120 & 154 & 15.25\% \\
Physical       & 20 & 90 & 110 & 10.89\% \\
Sexual         & 14 & 72 & 86 & 8.52\% \\
Relational     & 13 & 71 & 84 & 8.32\% \\ 
GS value = 0   & 60 & 242 & 302 & 29.90\% \\    
\midrule
               & 200 & 810 & 1010 & \\
\bottomrule
\end{tabular}
\end{adjustbox}
\caption{\label{table:summaryCategory}Dataset distribution by GS category.}
\end{subtable}

% ---------- MAIN CAPTION ----------
\caption{Dataset distribution statistics.}
\label{tab:dataset_distribution}

\end{table}


%\subsection{Data Distribution}

%The GSI:detect dataset is distributed under a CC BY-NC-SA 4.0 Licence\footnote{The dataset is publicly available at this \href{https://github.com/Caput97/GSI_detect}{link}. 
%A content warning applies, however, as some items may contain sensitive, offensive, or otherwise distressing content. -> Lo toglierei perché abbiamo messo il warning anche nell'abstract e dobbiamo recuperare spazio.
%}. Importantly, 
%the dataset preserves 
%the distributed dataset includes,
%besides the GS values,
%also the individual, non-aggregated 
%judgments of: riprendo la terminologia della tabella, se no occorrerebbe modificare quella parte la' 
%labels assigned by
%all annotators, in order to enable systems to learn from annotator disagreement \cite{Madedduetal2023}.
%\textit{Creative Commons NonCommmercial-ShareAlike License}.




%\todo[inline]{Manuela: dire che la distribuzione include anche le label non aggregate. GLORIA: fatto, aggiungere altro se serve}


\section{Evaluation} \label{sec:evaluation}


%The performance of participant systems in GS detection will be evaluated using a metric based on \textit{Mean Squared Error} (MSE), which measures the average of the squared differences between the actual and predicted GS values. To make system ranking more intuitive, the MSE will be normalized, so that higher values correspond to better performance.

%The performance of participant systems in GS classification will be evaluated using the F1 score, which balances precision and recall in a single metric.

%\todo[inline]{\textbf{Davide: Proposta di sezione evaluation. CONTROLLARE}}

Evaluation in GSI:detect is designed to reflect the 
%different 
specific nature of both the main task on GS detection, formulated as a regression problem, and the subtask on GS classification, formulated as a multi-class classification problem. Accordingly, we adopt task-specific evaluation criteria to ensure meaningful comparison and reliable system ranking.

\paragraph{GS Detection.} Participant systems' performance
%in the GS value detection task 
is assessed using a score derived from the \textit{Mean Squared Error} (MSE), which measures the average squared distance between predicted and
%gold-standard GS values
%original 
annotated
GS values, penalizing larger deviations more heavily. 
To improve interpretability and comparability across systems, the MSE is normalized with respect to the variance of the target distribution (\textit{Normalized Mean Squared Error}, NMSE). Since lower NMSE values indicate better performance, we further transform this quantity into a bounded score defined as $ \tfrac{1}{1 + \mathrm{NMSE}}$ \vspace{1mm}
%\[
%\mathrm{Score} = \frac{1}{1 + \mathrm{NMSE}}
%\]
so that higher values correspond to better predictive accuracy. This formulation enables an intuitive ranking of systems while preserving the relative performance differences.\\ In addition to the scores described above, we also report the \textit{Concordance Correlation Coefficient} (CCC) as a complementary measure of agreement between predictions and reference values, capturing both correlation and potential systematic bias.

\paragraph{GS Classification.}
GS classification is evaluated using the \textit{F1} score, which combines precision and recall into a single performance indicator.\footnote{As annotators assigned a GS category only to texts containing or referring to a stereotype, texts with GS value = 0 (i.e. no-no-no-no annotation) don't have a category. In the GS classification subtask, systems' performance has been therefore evaluated on 568 out of 810 texts.} To account for possible class imbalance while maintaining sensitivity to per-class behaviour, we report both \textit{Macro F1}, which weights all categories equally regardless of their frequency, and \textit{Micro F1}, which reflects the overall performance at instance-. While both measures are reported for analysis purposes, \textit{Micro F1} is adopted as the official metric for system ranking, as it provides an instance-level estimate of overall classification performance% and reduces the impact of variability in less frequent categories
.


\paragraph{Baselines.} 
For both GS detection and classification, participant systems' performance is compared against a set of four baselines\footnote{Baselines were computed only for the zero-shot track.}. They include a simple heuristic baseline obtained by assigning a constant GS value of 0.5 and a random GS category prediction (B1 in Table \ref{tab:results_zeroshot-task1}), as well as the performance of two LLMs, namely \textit{Qwen3-14B} (B2), and \textit{GPT-5-nano-2025-08-07}. 
The performance of \textit{GPT-5}, in particular, is evaluated under two prompting configurations: one where GS detection and classification are jointly addressed within a single prompt (B3), and one where the tasks are solved independently with two separate prompts (B4).
%For both GS detection and classification, participant systems' performance is compared against a set of baselines\footnote{Baselines were computed only for the zero-shot track.}. They include the performance of two LLMs, namely \textit{GPT-5} (gpt-5-nano-2025-08-07) and \textit{Qwen3} (14B), as well as a simple heuristic baseline obtained by assigning a constant GS value of 0.5 and a random GS category prediction. The baselines are evaluated under different prompting configurations, namely a \textit{split prompt} setting, where Task~1 and Task~2 are solved independently, and a \textit{unified prompt} setting, where both tasks are jointly addressed within a single prompt. Specifically, in Table \ref{tab:results_zeroshot-task1} and \ref{tab:results_zeroshot-task2} we will refer to \textit{GPT-5 nano} as Run~1 (split prompt) and Run~2 (unified prompt), while \textit{Qwen3} is evaluated in a unified prompt configuration (Run~3). The heuristic baseline corresponds to Run~4.\\

%This evaluation setup provides a balanced view of system effectiveness, highlighting both global classification accuracy and robustness across GS categories.



\section{Participants}
\label{sec:participants}

The GSI:detect shared-task attracted the participation of seven teams, coming both from academic and non-academic environments. Participant were allowed to submit multiple runs for each task, exploring different model architectures, prompting strategies, and technical configurations. The evaluation campaign included four different tracks for both the main task and the subtask: \textit{zero-shot}, \textit{few-shot}, \textit{fine-tuning} of LLMs, and \textit{encoder-only models}. Not all teams submitted runs to all tracks and tasks. Table \ref{tab:participants} presents the participating teams and reports the number of runs submitted by each of them for each track in both GS Detection and GS Classification tasks. Further analysis of the impact of the different runs and of their behavior across the two tasks, leading to different performance trends, is presented in Section~\ref{sec:results}. 
For a detailed description of the individual system configurations %architectures used, prompting strategies, and training configurations 
and methodologies associated with each run, we refer the reader to the participants’ %system description 
reports, cited in Table~\ref{tab:participants}.


\begin{table*}[t]
\centering
\footnotesize
\setlength{\tabcolsep}{4pt}
\captionsetup{font=small}
\begin{adjustbox}{max width=\textwidth}
\resizebox{\textwidth}{!}{
\begin{tabular}{l cccc | cccc}
\toprule
\multirow{2}{*}{\textbf{Team Name}} 
& \multicolumn{4}{c}{\textbf{GS Detection}} 
& \multicolumn{4}{c}{\textbf{GS Classification}} \\
\cmidrule(lr){2-5}\cmidrule(lr){6-9}
& \makecell{Zero-shot} 
& \makecell{Few-shot} 
& \makecell{Fine-Tuning} 
& \makecell{Encoder-only}
& \makecell{Zero-shot} 
& \makecell{Few-shot} 
& \makecell{Fine-Tuning} 
& \makecell{Encoder-only} \\
\midrule
DIAG-Sapienza \cite{DIAG_report} & 1 & 1 & - & 4 & 1 & 1 & - & 4 \\
Festa \cite{Festa_report}   & 5 & 5 & - & 5 & 5 & 5 & -  & 5  \\
MINDS \cite{MINDS_report}         & - & 2 & - & - & - & - & - & -  \\
Prisma \cite{Prisma_report}        & 5 & 1 & - & - & 5 & 1 & - & -  \\
StereoBusters \cite{Prisma_report} & 5 & 5 & - & - & 5  & 5  & - & -  \\
Tiz \cite{Tiz_report}       & 5 & - & 5 & - & 5 & - & - & -\\
VellaAsta\footnotemark     & - & - & - & 1 & -  & - & - & 1  \\
\bottomrule
Total & 21 & 14 & 5 & 10 & 21 & 12 & 0 & 10
\end{tabular}
}
\end{adjustbox}

\caption{Number of runs submitted by each team for each track in the GS Detection and GS Classification tasks.}
\label{tab:participants}
\end{table*}
%\footnotetext{Results for the VellaAsta team were provided directly to the organizers. Therefore, we report the obtained results, although no external system description report was submitted to EVALITA 2026 for this team.}
\footnotetext{As the VellaAsta team correctly and timely submitted the output of their system for official evaluation, we present their results even if they did not submit a report describing the system.}









\section{Results and Discussion}
\label{sec:results}
%\section{Task Overview: Participation and Results}
%Recap


%This section presents and discusses the results achieved by the participating systems and their different configurations. -> Questo potrebbe andare nell'intro
The results are organized according to the four main tracks of the shared task, under which participants submitted multiple runs for both the main task and the subtask. %(see Section~\ref{sec:participants} for an overview of participation)
This experimental setting results in eight distinct rankings, corresponding to the combination of the two tasks and the four tracks.  

%For a detailed description of the individual system architectures, prompting strategies, and training configurations referred to in the following analysis, we direct the reader to the participants’ system description papers cited in the previous section.

\subsection{Zero-Shot Track}Tables~\ref{tab:results_zeroshot-task1} and~\ref{tab:results_zeroshot-task2} report the scores obtained by the participating systems on the main task and on the subtask, respectively, in the zero-shot setting. Overall, in both tasks several participant systems outperform the proposed baselines, which are positioned approximately in the middle of the ranking and therefore act as a rough boundary between higher- and lower-performing approaches in this track.

For %GS Detection (Table~\ref{tab:results_zeroshot-task1})
our main task, the best-performing system is submitted by the DIAG-Sapienza team \cite{DIAG_report}, based on \textit{GPT-5} in a configuration that generates %both 
together the predicted GS value, the GS category plus a one-sentence explanation within the same model's call (see run~1 in Table \ref{tab:results_zeroshot-task1} and \ref{tab:results_zeroshot-task2}). %The same configuration is also used for the subtask (see DIAG-Sapienza run~1 in Table~\ref{tab:results_zeroshot-task2}), since both predictions are produced within a single model call. 
Interestingly, while this model's configuration achieves the top rank in GS Detection, its performance drops by seven positions in GS Classification, highlighting the increased difficulty of fine-grained stereotype categorization and of the underlying reasoning process. Nevertheless, in the main task this configuration substantially outperforms our \textit{GPT-5 nano} baseline (i.e., B4), whereas in the classification task the performance gap becomes much smaller (0.58 Micro F1 for DIAG-Sapienza vs. 0.53 for B4).

Another team showing consistently strong performance in this track is StereoBusters \cite{Stereobusters_report}, which evaluates several models of different sizes and configurations. In particular, \textit{Llama~3.3} (70B) achieves competitive results in GS Detection, approaching the performance of the closed-source \textit{GPT-5} model by DIAG-Sapienza (0.63 vs.\ 0.70). Moreover, %unlike \textit{GPT-5}, 
\textit{Llama~3.3} consistently maintains a high level of performance also in GS Classification (0.64 Micro F1), outperforming both proprietary models (\textit{GPT}- and \textit{Gemini}-based) and the systems submitted by the other teams. Additionally, the ensemble strategy adopted by this team (i.e., run~4), combining the predictions of four LLMs to determine the stereotype category, proves particularly effective in the classification task, achieving the highest Micro F1 score (0.646) in this subtask, with an improvement of roughly ten positions in the ranking compared to the main task.

The Festa team \cite{Festa_report} explores multiple configurations of \textit{Gemini~2.5~Flash}%. In most cases, these systems 
, outperforming most of our baselines in %the main task and consistently doing so in the subtask.
both tasks. For GS Detection, the best-performing configuration %among the five submitted runs 
(i.e., run~4) relies on an English prompt, despite the Italian nature of the data, combined with negative constraints aimed at minimizing false positives. By contrast, the use of Chain-of-Thought prompting both in English and Italian (runs~1 and~3) appears less beneficial in this task, yielding scores approximately seven points lower than run~4 (0.62 vs.\ 0.55). In the classification task, however, Chain-of-Thought prompting proves more effective, with performance comparable to \textit{Llama~3.3} (70B) by StereoBusters. %and also their LLMs' ensemble in the top positions of the ranking. 
This suggests that explicitly encouraging intermediate reasoning steps may support finer-grained category discrimination.

Finally, in this setting, the systems submitted by the Tiz \cite{Tiz_report} and Prisma \cite{Prisma_report} teams show substantially lower performance, consistently falling below the LLM-based baselines (i.e., B2, B3, and B4), with this performance gap remaining consistent across both the main task and the subtask. In particular, Prisma explores multiple configurations of \textit{Claude~3.5 Sonnet} based on different annotator personas and their aggregation; however, these highly polarized configurations do not yield competitive results, possibly due to a mismatch between the induced persona biases in the configuration and the annotator perspectives underlying the dataset. This could also suggest that strong persona conditioning, when misaligned with the annotator distributions of the dataset, may introduce biases that increase the distance from the target judgments.

Overall, what emerges from this scenario is that, although \textit{GPT-5} (DIAG-Sapienza) represents the best-performing system in the main GS Detection task, open-source models with different configurations and modeling strategies are able to closely approach, match, or even surpass closed-source systems across both tasks. In particular, StereoBusters’ open-source models narrow the gap with \textit{GPT-5} in the main task and outperform both proprietary systems and our baselines in GS Classification, regardless of whether large-scale (\textit{Llama~3.3} 70B) or mid-sized (\textit{Gemma~3} 12B) models are used.
This trend suggests -- and confirms -- that stereotype categorization, which inherently involves subjective interpretation and nuanced reasoning, benefits less from pure model scale and more from diversified modeling choices and decision aggregation strategies.




% ---------- ZERO-SHOT Task 1(tabularx, no shot column) ----------
\begin{table}[t]
\centering
\scriptsize
\setlength{\tabcolsep}{3pt}
\captionsetup{font=small}
\renewcommand{\arraystretch}{1.15}
\begin{adjustbox}{max width=0.80\textwidth}
\begin{tabularx}{\textwidth}{YYYrrrr}
\toprule
\textbf{Team name}  & \textbf{Model} & \textbf{Run id}  &
 $\bm{1/(1+NMSE)}$ $\uparrow$ & MSE $\downarrow$ & NMSE $\downarrow$ &  CCC $\uparrow$ \\
\midrule
DIAG-Sapienza  & GPT-5 & 1  & 0.70 & 0.077 & 0.43 & 0.78 \\
StereoBusters  & Llama\_3.3\_70B & 1  & 0.63 & 0.11 & 0.60 & 0.60 \\
StereoBusters  & Llama\_3.3\_70B & 2  & 0.62 & 0.11 & 0.61 & 0.59 \\
Festa  & Gemini\_2.5\_Flash & 4  & 0.62 & 0.11 & 0.61 & 0.70 \\
Festa  & Gemini\_2.5\_Flash & 5  & 0.62 & 0.11 & 0.62 & 0.69 \\
StereoBusters  & Gemma\_3\_12B & 0  & 0.61 & 0.11 & 0.64 & 0.56 \\
Festa  & Gemini\_2.5\_Flash & 2  & 0.61 & 0.11 & 0.64 & 0.69 \\
\rowcolor{gray!12}
BASELINE &  GPT-5 nano & B4  & 0.61 & 0.11 & 0.64 & 0.60 \\
Tiz  & Gemma\_3\_12B & 5  & 0.59 & 0.12 & 0.69 & 0.63 \\
\rowcolor{gray!12}
BASELINE  & GPT-5 nano & B3  & 0.59 & 0.12 & 0.69 & 0.57 \\
StereoBusters  & Panel\_4LLMS & 4  & 0.57 & 0.13 & 0.75 & 0.62 \\
StereoBusters  & Panel\_4LLMs & 3  & 0.56 & 0.14 & 0.77 & 0.60 \\
Festa  & Gemini\_2.5\_Flash & 1  & 0.56 & 0.14 & 0.79 & 0.63 \\
Festa  & Gemini\_2.5\_Flash & 3  & 0.55 & 0.14 & 0.80 & 0.62 \\
\rowcolor{gray!12}
BASELINE  & Qwen-3\_14B & B2  & 0.54 & 0.15 & 0.84 & 0.46 \\
\rowcolor{gray!12}
BASELINE  & N/A & B1  & 0.50 & 0.18 & 1.01 & 0 \\
Tiz  & Gemma\_3\_12B & 3  & 0.48 & 0.19 & 1.06 & 0.55 \\
Tiz  & Gemma\_3\_12B & 2  & 0.48 & 0.19 & 1.07 & 0.54 \\
Tiz  & Gemma\_3\_12B & 1  & 0.47 & 0.20 & 1.12 & 0.52 \\
Tiz  & Gemma\_3\_12B & 4  & 0.43 & 0.24 & 1.33 & 0.42 \\
Prisma  & Claude\_3.5\_Sonnet & 2  & 0.33 & 0.36 & 2.04 & 0.15 \\
Prisma  & Claude\_3.5\_Sonnet & 4  & 0.32 & 0.38 & 2.11 & 0.08 \\
Prisma  & Claude\_3.5\_Sonnet & 1  & 0.32 & 0.38 & 2.11 & 0.09 \\
Prisma  & Claude\_3.5\_Sonnet & 5  & 0.31 & 0.40 & 2.23 & 0.09 \\
Prisma  & Claude\_3.5\_Sonnet & 3  & 0.31 & 0.40 & 2.24 & 0.08 \\
\bottomrule
\end{tabularx}
\end{adjustbox}
\caption{Results for \textit{zero-shot} track [Main Task].}
\label{tab:results_zeroshot-task1}
\end{table}

% ---------- ZERO-SHOT Task2 ----------
\begin{table}[t]
\centering
\scriptsize
\setlength{\tabcolsep}{2pt}
\captionsetup{font=small}
\renewcommand{\arraystretch}{1.15}
\begin{adjustbox}{max width=0.80\textwidth}
\begin{tabularx}{\textwidth}{YYYrr}
\toprule
\textbf{Team name} & \textbf{Model} & \textbf{Run id} &
\textbf{F1 Micro} $\uparrow$ & F1 Macro $\uparrow$ \\
\midrule
StereoBusters & Panel\_4LLMS & 4 & 0.65 & 0.64 \\
StereoBusters & Llama\_3.3\_70B & 1 & 0.64 & 0.64 \\
StereoBusters & Panel\_4LLMs & 3 & 0.64 & 0.63 \\
StereoBusters & Llama\_3.3\_70B & 2 & 0.63 & 0.63 \\
Festa & Gemini\_2.5\_Flash & 1 & 0.60 & 0.56 \\
Festa & Gemini\_2.5\_Flash & 5 & 0.60 & 0.55 \\
Festa & Gemini\_2.5\_Flash & 4 & 0.59 & 0.55 \\
Festa & Gemini\_2.5\_Flash & 2 & 0.59 & 0.54 \\
DIAG-Sapienza & GPT-5 & 1 & 0.58 & 0.52 \\
Festa & Gemini\_2.5\_Flash & 3 & 0.57 & 0.52 \\
StereoBusters & Gemma\_3\_12B & 0 & 0.54 & 0.53 \\
\rowcolor{gray!12}
BASELINE & GPT-5 nano & B4 & 0.53 & 0.52 \\
\rowcolor{gray!12}
BASELINE & GPT-5 nano & B3 & 0.52 & 0.50 \\
\rowcolor{gray!12}
BASELINE & Qwen-3\_14B & B2 & 0.39 & 0.39 \\
Tiz & Gemma\_3\_12B & 2 & 0.23 & 0.23 \\
Tiz & Gemma\_3\_12B & 3 & 0.23 & 0.23 \\
Tiz & Gemma\_3\_12B & 1 & 0.22 & 0.22 \\
Prisma & Claude\_3.5\_Sonnet & 1 & 0.22 & 0.29 \\
Tiz & Gemma\_3\_12B & 5 & 0.22 & 0.22 \\
Tiz & Gemma\_3\_12B & 4 & 0.21 & 0.21 \\
\rowcolor{gray!12}
BASELINE & N/A & B1 & 0.18 & 0.18 \\
Prisma & Claude\_3.5\_Sonnet & 4 & 0.18 & 0.23 \\
Prisma & Claude\_3.5\_Sonnet & 2 & 0.17 & 0.24 \\
Prisma & Claude\_3.5\_Sonnet & 3 & 0.13 & 0.18 \\
Prisma & Claude\_3.5\_Sonnet & 5 & 0.12 & 0.17 \\
\bottomrule
\end{tabularx}
\end{adjustbox}
\caption{Results for \textit{zero-shot} track [Subtask].}
\label{tab:results_zeroshot-task2}
\end{table}

% ---------- FEW-SHOT Task 1(tabularx) ----------
\begin{table}[t]
\centering
\scriptsize
\setlength{\tabcolsep}{3pt}
\captionsetup{font=small}
\renewcommand{\arraystretch}{1.15}
\begin{adjustbox}{max width=0.80\textwidth}
\begin{tabularx}{\textwidth}{YYYYrrrr}
\toprule
\textbf{Team name}  & \textbf{Model} & \textbf{Run id} & \textbf{Shots} &
 $\bm{1/(1+NMSE)}$ $\uparrow$ & MSE $\downarrow$ & NMSE $\downarrow$ &  CCC $\uparrow$ \\
\midrule
DIAG-Sapienza  & GPT-5 & 1 & 4  & 0.68 & 0.08 & 0.46 & 0.76 \\
StereoBusters  & Gemma\_3\_27B & 2 & 5  & 0.64 & 0.10 & 0.55 & 0.65 \\
StereoBusters  & Gemma\_3\_12B & 0 & 5  & 0.63 & 0.10 & 0.59 & 0.60 \\
StereoBusters  & Gemma\_3\_12B & 1 & 5  & 0.61 & 0.11 & 0.63 & 0.61 \\
Prisma  & Claude\_3.5\_Sonnet & 6 & 35 & 0.59 & 0.12 & 0.70 & 0.63 \\
StereoBusters  & Panel\_4LLMS & 4 & 5  & 0.59 & 0.13 & 0.71 & 0.65 \\
MINDS  & Qwen2.5\_14B & 1 & 20  & 0.58 & 0.13 & 0.72 & 0.46 \\
Festa  & Gemini\_2.5 Flash & 5 & non-fixed  & 0.57 & 0.13 & 0.75 & 0.63 \\
Festa  & Gemini\_2.5 Flash & 1 & 12  & 0.56 & 0.14 & 0.77 & 0.62 \\
MINDS  & Qwen2.5\_14B & 2 & 20  & 0.56 & 0.14 & 0.77 & 0.42 \\
Festa  & Gemini\_2.5 Flash & 2 & 10  & 0.56 & 0.14 & 0.78 & 0.62 \\
Festa  & Gemini\_2.5 Flash & 4 & 14  & 0.55 & 0.14 & 0.82 & 0.60 \\
StereoBusters  & Panel\_4LLMs & 3 & 5  & 0.55 & 0.15 & 0.82 & 0.56 \\
Festa  & Gemini\_2.5 Flash & 3 & 6 & 0.50 & 0.18 & 1.00 & 0.55 \\
\bottomrule
\end{tabularx}
\end{adjustbox}
\caption{Results for \textit{few-shot} track [Main Task].}
\label{tab:results_fewshot-task1}
\end{table}

\subsection{Few-shot Track}

Most of the teams participating in the zero-shot track also submitted systems to the few-shot track, %with the only exception being the 
except for Tiz team, which participated exclusively in the former setting. This semi-overlap allows for a direct comparison of model behaviour across the two settings, highlighting how the same architectures can exhibit substantially different performance when provided with in-context examples.

According to Table \ref{tab:results_fewshot-task1}, for the GS Detection task, the DIAG-Sapienza team again achieves the best performance with \textit{GPT-5} (i.e., run~1, %using four in-context examples
four-shot), followed by the family of \textit{Gemma~3} models (12B and 27B) evaluated by %the 
StereoBusters. However, the systems developed by both teams are not able to maintain the same trend in the subtask (see Table \ref{tab:results_fewshot-task2}), dropping several positions in the ranking, except for \textit{Gemma 3} (27B). This finding suggests that, within the same model family, model size may positively influence performance across both tasks.

Notably, in contrast with the zero-shot setting, the Prisma team shows a marked improvement: its system  (i.e., run~6) benefits significantly from this configuration based on selecting 5 examples per category through stratified sampling. In this case, \textit{Claude~3.5 Sonnet} appears to react positively both to the presence of in-context examples and to the inclusion of additional information derived from the annotation guidelines. This effect is even more evident in the GS Classification task (Table \ref{tab:results_fewshot-task2}), where the same configuration %emerges as the best-performing system, 
outperforms all other submissions by approximately ten points %in Micro F1 
(0.71).

Yet, \textit{Gemini~2.5 Flash} %by team Festa 
exhibits a more heterogeneous behavior: while a 14-shot configuration leads to one of the worst performance in GS Detection, the presence of examples proves beneficial for reasoning on GS Classification, but the improvement over the zero-shot setting remains marginal (0.67 vs.\ 0.60).

A different trend is observed for the MINDS team \cite{MINDS_report}, which evaluates \textit{Qwen~2.5} (14B) exclusively in the few-shot setting for GS Detection. Two configurations (runs~1 and~2), each using 20 in-context examples, extract logits from the model and feed them into downstream statistical predictors (i.e., Linear Regression and KNN) to estimate the numerical GS value. This hybrid approach, however, remains substantially %below the best-performing systems
poor, with a gap of nearly ten points compared to the top model in this track (0.58 vs.\ 0.68).

\begin{comment}
    

Finally, \textit{GPT-5} confirms %the same trend 
what was observed %in the 
with zero-shot: it achieves the highest performance in GS Detection, yet its performance drops sharply in GS Classification (ten positions), despite the use of a four-shot prompt. %, resulting in a decrease of approximately ten ranking positions. 
Once again, this suggests that, in this setting, the model struggles with fine-grained stereotype categorization compared to several both proprietary and open-source alternatives.
\end{comment}


Moreover, the trend shown here closely mirrors the zero-shot setting one, further confirming that performance gains are driven less by model scale alone and more by how in-context examples are selected, structured, and integrated into the reasoning process. 
In support of this observation, we can observe that i. \textit{GPT-5} achieves the highest performance in GS Detection, yet its performance drops sharply in GS Classification (ten positions), despite the use of a four-shot prompt, ii. the perspectivist approach adopted by Prisma in the GS Classification subtask, combined with a careful selection of representative examples for each category, proves particularly effective, iii. in the main task, an open-source mid-sized model such as \textit{Gemma~3} (27B) by StereoBusters is able to achieve performance comparable to that of \textit{GPT-5} (DIAG-Sapienza).



% ---------- Task2 FEW-SHOT ----------
\begin{table}[t]
\centering
\scriptsize
\setlength{\tabcolsep}{2pt}
\captionsetup{font=small}
\renewcommand{\arraystretch}{1.15}
\begin{adjustbox}{max width=0.80\textwidth}
\begin{tabularx}{\textwidth}{YYYYrr}
\toprule
\textbf{Team name} & \textbf{Model} & \textbf{Run id} & \textbf{Shot} &
\textbf{F1 Micro} $\uparrow$ & F1 Macro $\uparrow$ \\
\midrule
Prisma & Claude\_3.5\_Sonnet & 6 & 35 & 0.71 & 0.61 \\
Festa & Gemini\_2.5 Flash & 4 & 14 & 0.67 & 0.67 \\
StereoBusters & Gemma\_3\_27B & 2 & 5 & 0.67 & 0.66 \\
Festa & Gemini\_2.5 Flash & 5 & non-fixed & 0.67 & 0.66 \\
StereoBusters & Panel\_4LLMs & 3 & 5 & 0.66 & 0.65 \\
StereoBusters & Panel\_4LLMS & 4 & 5 & 0.66 & 0.65 \\
Festa & Gemini\_2.5 Flash & 1 & 12 & 0.66 & 0.65 \\
Festa & Gemini\_2.5 Flash & 2 & 10 & 0.65 & 0.63 \\
Festa & Gemini\_2.5 Flash & 3 & 6 & 0.64 & 0.63 \\
DIAG-Sapienza & GPT-5 & 1 & 4 & 0.61 & 0.55 \\
StereoBusters & Gemma\_3\_12B & 0 & 5 & 0.57 & 0.57 \\
StereoBusters & Gemma\_3\_12B & 1 & 5 & 0.56 & 0.56 \\
\bottomrule
\end{tabularx}
\end{adjustbox}
\caption{Results for \textit{few-shot} track [Subtask].}
\label{tab:results_fewshot-task2}
\end{table}





\subsection{Fine-tuning}

Although potentially interesting %as a reference 
for comparison with the other tracks, this setting was explored only by the Tiz team, exclusively through different configurations of the same model (i.e., \textit{Gemma 3} 12B), and only for the main task. Nevertheless, the results obtained in this setup are highly promising, as the best-performing system (i.e., run 2) achieves a score of 0.64, ranking behind the best zero-shot and few-shot systems by only 6 and 2 points, respectively. Moreover, it is worth emphasizing that these fine-tuning results were obtained using an open model, which here is nearly matching the performance of a state-of-the-art proprietary model (i.e., \textit{GPT-5}) under both of the aforementioned settings.



% ---------- FINE-TUNING (LLMs) (tabularx, no shot, no additional_data) ----------
\begin{table}[t]
\centering
\scriptsize
\setlength{\tabcolsep}{3pt}
\captionsetup{font=small}
\renewcommand{\arraystretch}{1.15}
\begin{adjustbox}{max width=0.80\textwidth}
\begin{tabularx}{\textwidth}{YYYrrrr}
\toprule
\textbf{Team name}  & \textbf{Model} & \textbf{Run id} &
 $\bm{1/(1+NMSE)}$ $\uparrow$ & MSE $\downarrow$ & NMSE $\downarrow$ &  CCC $\uparrow$ \\
\midrule
Tiz  & Gemma\_3\_12B & 2 & 0.64 & 0.10 & 0.57 & 0.63 \\
Tiz  & Gemma\_3\_12B & 4 & 0.62 & 0.11 & 0.60 & 0.58 \\
Tiz  & Gemma\_3\_12B & 1 & 0.62 & 0.11 & 0.61 & 0.61 \\
Tiz  & Gemma\_3\_12B & 5 & 0.62 & 0.11 & 0.61 & 0.57 \\
Tiz  & Gemma\_3\_12B & 3 & 0.60 & 0.12 & 0.67 & 0.52 \\
\bottomrule
\end{tabularx}
\end{adjustbox}
\caption{Results for \textit{fine-tuning (LLMs)} track [Main Task].}
\label{tab:results_finetuning_llms}
\end{table}


\subsection{Encoder-only models}

The use of encoder-only models represents a valuable %point of 
reference for analyzing discrimination-oriented tasks. Although such systems can often achieve performance comparable to or even better than LLMs, especially when fine-tuned on the target task \cite{gottfert-etal-2025-nlpeace}, this trend does not emerge in our track
%. The systems evaluated under different configurations 
as proved by the Festa, DIAG-Sapienza, and VellaAsta teams, reported in Tables \ref{tab:results_encoder_only-task1} and \ref{tab:results_encoder_only-task2}. %, do not confirm this behavior. 
Overall, their results fall below our baselines on the main task and consistently find themselves in the lowest positions in the GS classification subtask.
When comparing models within the same track, the BERT-based %model 
\textit{UmBERTo} (Festa) achieves the best performance, with a clear margin over %both the 
\textit{RoBERTa} model (DIAG-Sapienza) and %the 
\textit{bertweet-sexism}\footnote{This model can be found in the Huggingface Hub \href{https://huggingface.co/tum-nlp/bertweet-sexism}{here}.} model (VellaAsta) %. This difference is probably due to the fact that 
likely due to its Italian-specific tokenization and masking strategies. Conversely, the \textit{tum-nlp/bertweet-sexism} model outperforms \textit{RoBERTa} in the GS detection task, as it has been fine-tuned for sexism detection on Twitter data. However, this specialization does not appear sufficient to achieve competitive performance on the more fine-grained GS classification task.

Overall, the consistently lower performance of encoder-only models across both tasks suggests that the limitations observed are not merely due to model capacity, but rather to architectural constraints. Unlike LLM-based systems, encoder-only models lack an explicit generative and reasoning component, which appears crucial for capturing the contextual, interpretative, and often implicit nature of gender stereotypes.
While task-specific pre-training or fine-tuning (e.g., sexism detection on social media) can provide advantages in coarse-grained detection, as observed for \textit{bertweet-sexism} model (VellaAsta Team) in the main task, such specialization does not translate into robust performance on fine-grained GS classification. This highlights the difficulty for encoder-only architectures to model subjective and socially grounded distinctions without access to richer contextual reasoning mechanisms.

% ---------- Task 1ENCODER-ONLY MODELS ----------
\begin{table}[t]
\centering
\scriptsize
\setlength{\tabcolsep}{3pt}
\captionsetup{font=small}
\renewcommand{\arraystretch}{1.15}
\begin{adjustbox}{max width=0.80\textwidth}
\begin{tabularx}{\textwidth}{YYYrrrr}
\toprule
\textbf{Team name}  & \textbf{Model} & \textbf{Run id} &
 $\bm{1/(1+NMSE)}$ $\uparrow$ & MSE $\downarrow$ & NMSE $\downarrow$ &  CCC $\uparrow$ \\
\midrule
Festa  & UmBERTo & 1  & 0.56 & 0.14 & 0.78 & 0.31 \\
Festa  & UmBERTo & 2  & 0.56 & 0.14 & 0.78 & 0.31 \\
Festa  & UmBERTo & 5  & 0.56 & 0.14 & 0.78 & 0.31 \\
Festa  & UmBERTo & 4  & 0.55 & 0.14 & 0.81 & 0.36 \\
Festa  & UmBERTo & 3  & 0.54 & 0.15 & 0.85 & 0.32 \\
VellaAsta  & tum-nlp/bertweet-sexism & 1  & 0.49 & 0.19 & 1.05 & 0.24 \\
DIAG-Sapienza  & RoBERTa & 4  & 0.47 & 0.20 & 1.13 & 0.37 \\
DIAG-Sapienza  & RoBERTa & 3  & 0.46 & 0.21 & 1.16 & 0.37 \\
DIAG-Sapienza  & RoBERTa & 1  & 0.45 & 0.21 & 1.21 & 0.36 \\
DIAG-Sapienza  & RoBERTa & 2  & 0.45 & 0.22 & 1.24 & 0.36 \\
\bottomrule
\end{tabularx}
\end{adjustbox}
\caption{Results for \textit{encoder-only models} track [Main Task].}
\label{tab:results_encoder_only-task1}
\end{table}

% ---------- Task2 ENCODER-ONLY MODELS ----------
\begin{table}[t]
\centering
\scriptsize
\setlength{\tabcolsep}{2pt}
\captionsetup{font=small}
\renewcommand{\arraystretch}{1.15}
\begin{adjustbox}{max width=0.80\textwidth}
\begin{tabularx}{\textwidth}{YYYrr}
\toprule
\textbf{Team name} & \textbf{Model} & \textbf{Run id} &
\textbf{F1 Micro} $\uparrow$ & F1 Macro $\uparrow$ \\
\midrule
Festa & UmBERTo & 1 & 0.52 & 0.50 \\
Festa & UmBERTo & 2 & 0.49 & 0.47 \\
Festa & UmBERTo & 3 & 0.48 & 0.46 \\
Festa & UmBERTo & 5 & 0.47 & 0.45 \\
Festa & UmBERTo & 4 & 0.45 & 0.43 \\
DIAG-Sapienza & RoBERTa & 4 & 0.37 & 0.29 \\
DIAG-Sapienza & RoBERTa & 3 & 0.36 & 0.29 \\
DIAG-Sapienza & RoBERTa & 2 & 0.35 & 0.29 \\
DIAG-Sapienza & RoBERTa & 1 & 0.35 & 0.29 \\
VellaAsta & tum-nlp/bertweet-sexism & 1 & 0.10 & 0.11 \\
\bottomrule
\end{tabularx}
\end{adjustbox}
\caption{Results for \textit{encoder-only models} track [Subtask].}
\label{tab:results_encoder_only-task2}
\end{table}






\section{Conclusions} \label{sec:conclusions}
In this paper, we presented the EVALITA shared task GSI:detect, which focuses on the automatic identification and classification of gender stereotypes in Italian. The task was structured into two subtasks: (i) a main task targeting the detection of GSs, and (ii) a fine-grained subtask aimed at classifying the type of GS expressed in the text.
The dataset was constructed by explicitly leveraging disagreement among human annotators, 
%with the goal of preserving 
to preserve the intrinsic subjectivity of the phenomenon rather than enforcing a single, fully convergent label. This design choice allows the benchmark to better reflect the variability of human judgments in sensitive and socially grounded tasks.\\
The results obtained by the seven participating teams on the main task, and by a subset of them on the subtask, highlight how different model architectures and configurations have diverse responses 
%respond differently 
to a task in which subjectivity is a core component. Across zero-shot, few-shot, and encoder-only settings, we observe that performance is not solely determined by model scale, but is strongly influenced by architectural choices, prompting strategies, and the way contextual information is integrated into the reasoning process.
In particular, for such fine-grained and inherently subjective settings%, the performance gap between open-source models and state-of-the-art proprietary systems tends to narrow. 
, the performance gap between open-source and state-of-the-art proprietary systems consistently narrows, and in several cases reverses, especially when open models are combined with diversified perspectives or carefully selected in-context examples. Conversely, encoder-only architectures struggle to achieve competitive performance, suggesting that generative and reasoning capabilities play a crucial role in modeling socially grounded and interpretative distinctions.

These findings suggest that model behavior cannot be fully characterized only in terms of raw accuracy, but should also be analysed in relation to how models implicitly encode and reproduce subjective judgments.
More broadly, the results highlight the importance of evaluation frameworks that explicitly account for subjectivity, disagreement, and modeling diversity when assessing AI systems on sensitive social phenomena.
For this reason, investigating the interplay between model %subjectivity 
and human subjectivity represents a promising research direction for explaining some of the observed dynamics. \\ %As future work, we plan to perform a socio-profiling analysis of the evaluated models, with the aim of aligning their prediction patterns with specific human socio-demographic groups, in order to better understand the sources and structure of their biases and sensitivities.
Future work will focus on socio-profiling the models to relate their prediction patterns to specific socio-demographic groups, shedding light on the sources and structure of their biases and sensitivities.


















%% The acknowledgments section is defined using the "acknowledgments" environment
%% (and NOT an unnumbered section). This ensures the proper
%% identification of the section in the article metadata, and the
%% consistent spelling of the heading.




%\clearpage
\begin{acknowledgments}

This paper and the GSI:detect Task are the result of collaboration between all authors. Gloria Comandini wrote \ref{sec:motivation} and \ref{sec:dataset}; Manuela Speranza wrote \ref{sec:task-description}; Sofia Brenna wrote \ref{sec:evaluation}; Davide Testa wrote \ref{sec:participants}, \ref{sec:results} and \ref{sec:conclusions}.\\
This work was carried out while Davide Testa was
enrolled in the Italian National Doctorate on Artificial
Intelligence run by Sapienza University of Rome %in collaboration with
together with Fondazione Bruno Kessler (FBK).
\end{acknowledgments}

%% The declaration on generative AI comes in effect
%% in Janary 2025. See also
%% https://ceur-ws.org/GenAI/Policy.html
\section*{Declaration on Generative AI}

%During the preparation of this work, the authors used ChatGPT (GPT-5.2) in order to: refine the writing style of the manuscript and support the improvement of selected parts of the code developed for system evaluation, particularly during the implementation of the scoring module used to assess the participants’ systems. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the publication’s content.
%During the preparation of this work, the 

Authors used ChatGPT (GPT-5.2) to refine the manuscript’s writing style and to support %selected 
parts of the evaluation code. The authors reviewed and edited all outputs and take full responsibility for the content.



%% Define the bibliography file to be used
\bibliography{sample-ceur}

%% If your work has an appendix, this is the place to put it.
\appendix

\end{document}

%%
%% End of file
