%% The first command in your LaTeX source must be the \documentclass command.
%%
%% Options:
%% twocolumn : Two column layout. Do not use twocolumn for papers submitted to CEUR-WS!
%% hf: enable header and footer.
\documentclass[
% twocolumn,
% hf,
]{ceurart}

%%
%% One can fix some overfulls
\sloppy

%%
%% Minted listings support 
%% Need pygment <http://pygments.org/> <http://pypi.python.org/pypi/Pygments>
\usepackage{listings}
%% auto break lines
\lstset{breaklines=true}

% CUSTOM DEFINITIONS: NAME
\newcommand{\name}{\textsc{FadeIT}}

\newcommand{\vaccines}{\raisebox{-3pt}{\includegraphics[width=1.2em]{img/emojis/vaccines.png}}}

\definecolor{Gray}{gray}{0.95}
\definecolor{yellow}{rgb}{0.98, 0.91, 0.71}

\newcolumntype{a}{>{\columncolor{Gray}}r}
\newcolumntype{b}{>{\columncolor{Gray}}c}
\newcolumntype{d}{>{\columncolor{yellow}}c}

\DeclareRobustCommand{\hlgray}[1]{{\sethlcolor{Gray}\hl{#1}}}
\DeclareRobustCommand{\hlyel}[1]{{\sethlcolor{yellow}\hl{#1}}}

%%
%% end of the preamble, start of the body of the document source.
\begin{document}

%%
%% Rights management information.
%% CC-BY is default license.
\copyrightyear{2026}
\copyrightclause{Copyright for this paper by its authors.
  Use permitted under Creative Commons License Attribution 4.0
  International (CC BY 4.0).}

%%
%% This command is for the conference information
\conference{EVALITA 2026: 9th Evaluation Campaign of Natural Language
Processing and Speech Tools for Italian, Feb 26 – 27, Bari, IT}

%%
%% The "title" command
\title{FadeIT at EVALITA 2026: Overview of the Fallacy Detection in Italian Social Media Texts Task}

%\tnotemark[1]
%\tnotetext[1]{You can use this document as the template for preparing your
%  publication. We recommend using the latest version of the ceurart style.}

%%
%% The "author" command and its associated commands are used to define
%% the authors and their affiliations.
\author[1]{Alan Ramponi}[%
orcid=0000-0002-4305-2404,
email=alramponi@fbk.eu,
]
\cormark[1]
\address[1]{Fondazione Bruno Kessler,
  Digital Humanities Unit -- Trento, Italy}

\author[1]{Sara Tonelli}[%
orcid=0000-0001-8010-6689,
email=satonelli@fbk.eu,
]

%% Footnotes
\cortext[1]{Corresponding author.}

%%
%% The abstract is a short summary of the work to be presented in the
%% article.
\begin{abstract}
F\textsc{adeIT}~is the first shared task on fallacy detection in social media texts in Italian, an understudied language for this task. \name~relies on \textsc{Faina}, a fallacy detection dataset that includes span-level annotations with overlaps for 20 fallacy types in social media texts about migration, climate change, and public health over a 4-year time period. The shared task is articulated into two subtasks at different granularities: i) \emph{post-level fallacy detection}, aiming at predicting the fallacy types expressed in each input post, and ii) \emph{span-level fallacy detection}, aiming at predicting all text segments expressing any given fallacy type in each input post. Participants' systems are evaluated against two equally valid gold standards (i.e., parallel annotations in \textsc{Faina}) to account for natural disagreement, in line with recent work advocating the importance of considering human label variation in subjective tasks. 
\name~has attracted wide interest at Evalita 2026 with a total of 25 runs submitted by 7 participant teams. 
In this paper, we present the task setup, including the data used and the evaluation criteria, as well as the results obtained by all participant teams, an analysis of their approaches, and insights for future research on the topic.
\end{abstract}

%%
%% Keywords. The author(s) should pick words that accurately describe
%% the work being presented. Separate the keywords with commas.
\begin{keywords}
  Natural language processing \sep
  fallacy detection \sep
  argumentation mining \sep
  human label variation
\end{keywords}

%%
%% This command processes the author and affiliation and title
%% information and builds the first part of the formatted document.

\maketitle

\section{Introduction}

Fallacies are arguments that seem valid but are not~\citep{hamblin2022fallacies,tindale2007fallacies}; namely, statements that are logically flawed or in which evidence is replaced by emotional cues. Fallacious argumentation frequently occurs in everyday discourse, either intentionally -- for persuading the audience -- or unintentionally. With the widespread use of online platforms, fallacious social media posts have the potential to mislead a large audience, in some cases leading to the proliferation of misinformation~\citep{musi2022developing}. Therefore, recognizing fallacies is paramount not only to limit the spread of misleading content, but also to develop individuals' critical thinking skills and promote democratic debate~\cite{ecker2024misinformation}.

Motivated by the intrinsic difficulty of the fallacy detection task for automated systems including large language models (LLMs)~\citep{alhindi-etal-2022-multitask,ramponi-etal-2025-fine}, we present \name, the first shared task on fallacy detection in Italian social media texts. \name~has been organized as part of the Evalita 2026 evaluation campaign~\cite{evalita2026overview}, and comprises two subtasks of increasing complexity: a subtask that deals with the detection of fallacies at the post level (i.e., a multi-label text classification problem) and a more challenging subtask requiring the detection of fallacies at the span level (i.e., a multi-label span classification problem). \name~relies on \textsc{Faina}~\citep{ramponi-etal-2025-fine}, the first dataset for fallacy detection in Italian embracing multiple plausible answers and natural disagreement, with annotations across an inventory of 20 fallacy types at the fine-grained level of text segments with potential overlaps. It~covers public discourse on migration, climate change, and public health issues in social media posts over a large time frame of 4 years. The evaluation of participants' systems is carried out by comparing their submitted predictions with multiple gold standards -- included in the \textsc{Faina} dataset as parallel annotations -- in order to account for human label variation~\citep{plank-2022-problem} -- i.e., the genuine disagreement that naturally occurs in subjective NLP tasks.

In the following sections, we present details on the task (Section~\ref{sec:task}), the data used for the competition (Section~\ref{sec:data}), the evaluation setup (Section~\ref{sec:eval}), the participant teams and their results (Section~\ref{sec:participants-and-results}), followed by an analysis and discussion (Section~\ref{sec:analysis-discussion}) and our conclusions (Section~\ref{sec:conclusions}).


\section{Task description} \label{sec:task}

The \name~shared task focuses on detecting fallacies expressed in Italian social media texts about migration, climate change, and public health issues at different granularities: at the post level (subtask A; Section~\ref{sec:subtask-a}) and at the span level (subtask B; Section~\ref{sec:subtask-b}). For each post, there can be zero, one, or more fallacies, among an inventory of 20 fallacy types, to be detected at the chosen granularity. We refer to Section~\ref{sec:annotation} for the list of fallacy types and to the \textsc{Faina} dataset~\citep{ramponi-etal-2025-fine} for detailed descriptions.

\subsection{Subtask A: Post-level fallacy detection} \label{sec:subtask-a}

Given the text of a social media post, predict all the fallacy types expressed in it. This is a multi-label classification task (20 classes) and represents the easiest setup -- i.e., there is no need to \emph{locate} each occurring fallacy type within the text, just to \emph{detect} which ones (if any) are expressed in it.

\subsection{Subtask B: Span-level fallacy detection} \label{sec:subtask-b}

Given the text of a social media post, predict all the text segments expressing fallacies and give each of them a fallacy type. This is a challenging, multi-label span classification task (20 classes) and represents the hardest setup -- i.e., different fallacies may overlap, partially or in full, with each other. The text of the posts provided to participants is already divided into tokens.


\section{Data} \label{sec:data}

In this section, we summarize how the \textsc{Faina} dataset used for the \name~task has been collected (Section~\ref{sec:collection}) and annotated (Section~\ref{sec:annotation}). Moreover, we provide information on data splits (Section~\ref{sec:splits}) and format (Section~\ref{sec:format}).
Further details are provided in the original paper introducing the dataset~\citep{ramponi-etal-2025-fine}.

\subsection{Data collection} \label{sec:collection}

Data collection was conducted using the Twitter APIs in February 2023.\footnote{At that time, the Twitter (now X) APIs for research purposes were still available for free.} We collected social media posts covering discourse on migration, climate change, and public health using a manually curated list of keywords. The time period of the posts represented in the dataset is from \texttt{2019-01-01} to \texttt{2022-12-31}. From this collection, we selected the posts with at least 5 tokens and greatest potential impact to the society -- i.e., those with the highest number of retweets and likes, as in~\citet{nakov2022overview}. We mitigated topic and temporal biases by keeping the top 10 posts for each month and topic combination (e.g., 10 posts about ``migration'' posted in \texttt{2021-01}). We further mitigate authors' stylistic bias by excluding multiple posts by the same user and resampling them until we obtained the 10 posts. As a result, the \textsc{Faina} dataset consists of 1,440 posts balanced across topics (480 per topic) and time (360 per year). 

\subsection{Data annotation} \label{sec:annotation}

The collected posts underwent fine-grained annotation at the span-level with overlaps. Due to the difficulty of the task, we devised an annotation protocol that consisted of five rounds of annotation and discussion among two expert annotators ($\mathcal{A}_1$ and $\mathcal{A}_2$). At each round, each annotator individually located and classified text segments expressing fallacies. Then, annotators met to discuss the instances that diverged in the assigned fallacy type, span extent, or both. Instead of forcing a ``single ground truth'' in data, the goal of the discussion phase was to minimize annotation errors (e.g., due to attention drops) whilst keeping signals of human label variation (e.g., genuine disagreement, such as multiple plausible annotations due to different interpretations of the text). 
Overall, \textsc{Faina} consists of 11,064 annotated spans (5,532$_{\pm253}$/annotator) across 58,490 tokens. 
For allowing a post-level version of the task to be addressed (i.e., subtask A), span-level annotations have also been transposed to the post-level.\footnote{This was done by assigning to each post the set of unique fallacy span types occurring in it.}

\paragraph{Inventory of fallacy types} The 20 fallacy categories are: \emph{Ad hominem} (\textsc{ah}), \emph{Appeal to authority} (\textsc{aa}), \emph{Appeal to emotion} (\textsc{ae}), \emph{Causal oversimplification} (\textsc{co}), \emph{Cherry picking} (\textsc{cp}), \emph{Circular reasoning} (\textsc{cr}), \emph{Doubt} (\textsc{do}), \emph{Evading the burden of proof} (\textsc{ep}), \emph{False analogy} (\textsc{fa}), \emph{False dilemma} (\textsc{fd}), \emph{Flag waving} (\textsc{fw}), \emph{Hasty generalization} (\textsc{hg}), \emph{Loaded language} (\textsc{ll}), \emph{Name calling or labeling} (\textsc{nc}), \emph{Red herring} (\textsc{rh}), \emph{Slippery slope} (\textsc{ss}), \emph{Slogan} (\textsc{sl}), \emph{Strawman} (\textsc{st}), \emph{Thought-terminating cliché} (\textsc{tc}), and \emph{Vagueness} (\textsc{va}). For fallacy definitions, inter-annotator agreement scores before and after discussions, annotation guidelines, and statistics, please refer to the paper introducing \textsc{Faina}~\citep{ramponi-etal-2025-fine}. We provide examples from \textsc{Faina} in Figure~\ref{fig:examples}.

\begin{figure}[h]
     \centering
         \includegraphics[width=0.98\linewidth]{img/examples.png}
         \caption{Example of annotated posts from the \textsc{Faina} dataset~\citep{ramponi-etal-2025-fine}, keeping genuine disagreement among two expert annotators ($\mathcal{A}_1$, $\mathcal{A}_2$). English translations: ``\emph{American study: mutation spreads four times faster, but \vaccines~are needed}'' (\emph{left}); ``\emph{Italy welcomes them all and becomes Europe's refugee camp, goal achieved as promised!}'' (\emph{right}).}
         \label{fig:examples}
\end{figure}

\subsection{Data splits} \label{sec:splits}
For the purpose of the shared task, the dataset has been split into two official sets: one for training/development (80\%; 1,152 posts)\footnote{Participant teams have been left free to decide how to split the training/development set to tune and select their systems.} and one for testing (20\%; 288 posts). 
These have been created by paying particular attention to label, time, and topic distribution across the splits to ensure reliability in the official evaluation. The posts represented in the splits are the same for both subtask A and B.

\subsection{Data format} \label{sec:format}

\paragraph{Subtask A} For \emph{post-level fallacy detection}, the data is in a tab-separated format with a header line. Each line consists of information about each post (i.e., id, date, topic, text, labels). Post-level annotations by each annotator are provided in separate columns and multiple annotations for the same post and annotator are separated by a pipe. Specifically, each post is represented as shown in Table~\ref{tab:format} (\emph{top}).

\paragraph{Subtask B} For \emph{span-level fallacy detection}, the data format is based on the CoNLL format. Each post is separated by a blank line and consists of a header with post information, followed by each token in the text (with tab-separated information) separated by newlines. Token annotations follow the BIO scheme (i.e., B: begin, I: inside, O: outside) and multiple annotations for the same token and annotator are separated by a pipe. Specifically, a post in \textsc{Faina} is represented as shown in Table~\ref{tab:format} (\emph{bottom}).

\begin{table}[h]
\centering
\caption{\textsc{Faina} format for subtask A (\emph{top}) and subtask B (\emph{bottom}). Each variable is described in Section~\ref{sec:format}.}
\resizebox{1\linewidth}{!}{%
    \centering
    \begin{tabular}{l}
    \toprule
    \texttt{\$POST\_ID~~~~\$POST\_DATE~~~~\$POST\_TOPIC~~~~\$POST\_TEXT~~~~\$LABELS\_BY\_ANN\_1~~~~\$LABELS\_BY\_ANN\_2}\\
    \bottomrule \\[-5pt]
    
    \toprule
    \texttt{\# post\_id = \$POST\_ID}\\
    \texttt{\# post\_date = \$POST\_DATE}\\
    \texttt{\# post\_topic\_keywords = \$POST\_TOPIC}\\
    \texttt{\# post\_text = \$POST\_TEXT}\\
    \texttt{\$TOKEN\_1~~~~\$TOKEN\_1\_TEXT~~~~\$TOKEN\_1\_LABELS\_BY\_ANN\_1~~~~\$TOKEN\_1\_LABELS\_BY\_ANN\_2}\\
    
    \texttt{...}\\
    \texttt{\$TOKEN\_N~~~~\$TOKEN\_N\_TEXT~~~~\$TOKEN\_N\_LABELS\_BY\_ANN\_1~~~~\$TOKEN\_N\_LABELS\_BY\_ANN\_2}\\
    \bottomrule
    \end{tabular}
}%
    \label{tab:format}
\end{table}

\noindent The variables in Table~\ref{tab:format} are defined as follows:
\begin{itemize}
    \item \textbf{\texttt{\$POST\_ID}}: the identifier of the post, different from the Twitter one to preserve user's anonymity;
    \item \textbf{\texttt{\$POST\_DATE}}: the date of the post (in the \texttt{YYYY-MM} format);
    \item \textbf{\texttt{\$POST\_TOPIC}}: the topic of the post (i.e., ``migration'', ``climate change'', or ``public health'');
    \item \textbf{\texttt{\$POST\_TEXT}}: the text of the post, anonymized with placeholders;\footnote{User mentions, URLs, email addresses, and phone numbers are replaced with \texttt{[USER]}, \texttt{[URL]}, \texttt{[EMAIL]}, and \texttt{[PHONE]} placeholders, respectively.}
    \item \textbf{\texttt{\$LABELS\_BY\_ANN\_j}}: the fallacy label(s) assigned by annotator \emph{j} for the post (e.g., ``Vagueness'', ``Strawman''). In the case where multiple labels for the post are assigned by the same annotator \emph{j}, these are separated by a pipe and ordered lexicographically, e.g., ``Strawman|Vagueness''. In the case where no labels for the post are assigned by the same annotator \emph{j}, the label is empty;
    \item \textbf{\texttt{\$TOKEN\_i}}: the index of the token within the post (i.e., an incremental integer);
    \item \textbf{\texttt{\$TOKEN\_i\_TEXT}}: the text of the \emph{i}-th token within the post;
    \item \textbf{\texttt{\$TOKEN\_i\_LABELS\_BY\_ANN\_j}}: the fallacy label(s) assigned by annotator \emph{j} for the \emph{i}-th token within the post. Each label follows the format \texttt{\$BIO-\$LABEL}, where \texttt{\$BIO} is the BIO tag and \texttt{\$LABEL} is the fallacy label (e.g., ``Vagueness'', ``Strawman''), e.g., ``B-Vagueness'', ``I-Strawman'', and ``O''. In the case where multiple labels for the \emph{i}-th token are assigned by the same annotator \emph{j}, these are separated by a pipe and ordered lexicographically by \texttt{\$LABEL}, e.g., ``I-Strawman|B-Vagueness''.
\end{itemize}

\noindent We left it up to participant teams to decide whether to aggregate gold annotations by different annotators (e.g., using majority voting), using only one, or leveraging all of them for designing their systems. Nevertheless, to account for human label variation, systems are evaluated against all gold standards (Section~\ref{sec:metrics}). An example of a post following the aforementioned data formats is presented in Table~\ref{tab:data-examples}.

\begin{table}[h]
\caption{Post from Figure~\ref{fig:examples} (\emph{left}) following the data format for subtask A (\emph{top}) and B (\emph{bottom}). In both cases, the last two columns indicate annotations provided by annotators $\mathcal{A}_1$ and $\mathcal{A}_2$ due to different (equally valid) interpretations.}
\resizebox{1\linewidth}{!}{%
    \centering
    \begin{tabular}{l}
    \toprule
    \texttt{658~~~~2021-06~~~~public~~~~Studio americano: la mutazione~~~~Appeal-to-authority|~~~~~~~~~~~~~~Appeal-to-authority|}\\
    \texttt{~~~~~~~~~~~~~~~~~~health~~~~si diffonde quattro volte più~~~~~Evading-the-burden-of-proof|~~~~~~Doubt|Vagueness}\\
    \texttt{~~~~~~~~~~~~~~~~~~~~~~~~~~~~velocemente, ma i \vaccines~servono~~~~~~Hasty-generalization|Vagueness}\\
    \bottomrule\\[-5pt]
    
    \toprule
    \texttt{\# post\_id = 658}\\
    \texttt{\# post\_date = 2021-06}\\
    \texttt{\# post\_topic\_keywords = public health}\\
    \texttt{\# post\_text = Studio americano: la mutazione si diffonde quattro volte più velocemente, ma i \vaccines~servono}\\
    \texttt{1~~~~~Studio~~~~~~~B-Appeal-to-authority|B-Vagueness~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~B-Appeal-to-authority|B-Vagueness}\\
    \texttt{2~~~~~americano~~~~I-Appeal-to-authority|I-Vagueness~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~I-Appeal-to-authority|I-Vagueness}\\
    \texttt{3~~~~~:~~~~~~~~~~~~I-Appeal-to-authority~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~I-Appeal-to-authority}\\
    \texttt{4~~~~~la~~~~~~~~~~~B-Evading-the-burden-of-proof~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~O}\\
    \texttt{5~~~~~mutazione~~~~I-Evading-the-burden-of-proof~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~O}\\
    \texttt{6~~~~~si~~~~~~~~~~~I-Evading-the-burden-of-proof~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~O}\\
    \texttt{7~~~~~diffonde~~~~~I-Evading-the-burden-of-proof~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~O}\\
    \texttt{8~~~~~quattro~~~~~~I-Evading-the-burden-of-proof~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~O}\\
    \texttt{9~~~~~volte~~~~~~~~I-Evading-the-burden-of-proof~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~O}\\
    \texttt{10~~~~più~~~~~~~~~~I-Evading-the-burden-of-proof~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~O}\\
    \texttt{11~~~~velocemente~~I-Evading-the-burden-of-proof~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~O}\\
    \texttt{12~~~~,~~~~~~~~~~~~I-Evading-the-burden-of-proof~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~O}\\
    \texttt{13~~~~ma~~~~~~~~~~~I-Evading-the-burden-of-proof~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~B-Doubt}\\
    \texttt{14~~~~i~~~~~~~~~~~~I-Evading-the-burden-of-proof|B-Hasty-generalization~~~~~~~~~~~~I-Doubt}\\
    \texttt{15~~~~\vaccines~~~~~~~~~~~I-Evading-the-burden-of-proof|I-Hasty-generalization~~~~~~~~~~~~I-Doubt}\\
    \texttt{16~~~~servono~~~~~~I-Evading-the-burden-of-proof|I-Hasty-generalization~~~~~~~~~~~~I-Doubt}\\
    
    \bottomrule
    \end{tabular}
}%
    \label{tab:data-examples}
\end{table}


\section{Evaluation} \label{sec:eval}

Each team was allowed to submit up to 3 runs (i.e., predictions on the test set) for each subtask. We here introduce the metrics used for assessing performance (Section~\ref{sec:metrics}) and our baselines (Section~\ref{sec:baselines}).

\subsection{Metrics} \label{sec:metrics}

We employ different metrics for evaluating participants' runs in subtask A and B, as detailed below.

\paragraph{Subtask A} The submitted runs are evaluated using micro- and macro-averaged precision, recall, and F$_1$ score, averaged on the two equally-valid gold standard annotations of the 20\% held-out test set. Runs are then ranked by micro F$_1$ score.

\paragraph{Subtask B} We evaluate the runs using metrics designed for span-level annotations with potential overlaps, averaged on the two equally-valid gold standard annotations of the 20\% held-out test set. We adopt micro- and macro-averaged precision, recall, and F$_1$ score variants proposed by~\citet{da-san-martino-etal-2019-fine}, extended to work at the token level. Partial credit is therefore given to partial span matches, proportional to the length of the match in terms of tokens. To account for the severity of labeling errors (e.g., predicting \emph{Red herring} instead of \emph{Appeal to authority} is less problematic than predicting \emph{False dilemma}), results are also computed in a ``soft'' evaluation mode, namely giving partial credit (i.e., 0.5 instead of 1.0) if the predicted label is an immediate parent of the actual label in the taxonomy of fallacy types by~\citet{ramponi-etal-2025-fine}. Runs are then ranked by micro F$_1$ score in the \emph{soft} evaluation mode.

\subsection{Baselines} \label{sec:baselines}

As baseline systems, we provided two encoder-based models for each subtask. These models have been previously described in the paper introducing \textsc{Faina}~\citep{ramponi-etal-2025-fine} and are summarized in the following.\footnote{The code for the baselines is available in the original \textsc{Faina}'s repository: \url{https://github.com/dhfbk/faina}.}

\paragraph{\textsc{MVML-alb} and \textsc{MVML-umb} models} A multi-view, multi-label (\textsc{mvml}) model that relies on a shared encoder (either AlBERTo~\citep{polignano-etal-2019-alberto} or UmBERTo~\citep{parisi-etal-2020-umberto}, i.e., \textsc{alb} or \textsc{umb}), uses $D=|A|$ decoders (one for each annotation view, i.e., for the labels assigned by each annotator), and outputs $D$ sets of predicted labels containing all fallacy labels that exceed a threshold $\tau$ (with $\tau=0.7$).

\paragraph{\textsc{MVMD-alb} and \textsc{MVMD-umb} models} A multi-view, multi-decoder (\textsc{mvmd}) model that relies on a shared encoder (either AlBERTo~\citep{polignano-etal-2019-alberto} or UmBERTo~\citep{parisi-etal-2020-umberto}, i.e., \textsc{alb} or \textsc{umb}), uses a separate decoder for each annotation view $A$ and fallacy type $F$ (i.e., $D = |A \times F|$), and outputs $D$ sets of predicted labels (i.e., either `B', `I', or `O' for each fallacy label and annotation view). All decoders are given equal importance in the computation of the multi-task learning loss.


\section{Participants and results} \label{sec:participants-and-results}

The \name~shared task has attracted a total of 25 runs by 7 participant teams. Specifically, for subtask A we received 16 runs by 6 teams,\footnote{These include a late run by the team \emph{MALTO} (run 3).} whereas for the more challenging subtask B we received 9 runs by 3 teams. Overall, \name~has been one of the most participated shared tasks at Evalita 2026 and attracted interest of teams from both academia and industry, representing institutions across five different countries (i.e., India, Italy, Japan, Netherlands, and Vietnam). An overview of participant teams' approaches is provided in Section~\ref{sec:overview-participants}, whereas results for both subtasks are presented in Section~\ref{sec:results}.

\subsection{Overview of participant teams' approaches} \label{sec:overview-participants}

A summary of the approaches adopted by each team is provided in the following, alphabetically ordered by team name. For additional details on each submitted run (e.g., model versions, hyper-parameter choices, data and prompt variations), we refer the reader to the system description paper of each team.

\paragraph{Kenji-Endo~\citep{team-kenjiendo}} The team proposed a model with focus on efficiency to tackle multiple Evalita tasks, including \name's subtask A. The system is based on a decoder-only causal language model that was first pretrained on a mixture of Italian corpora, and then fine-tuned on \textsc{Faina} data in a discriminative setting by taking into account class imbalance in the loss function. The team submitted a single run.

\paragraph{Label~\citep{team-label}} The team participated in subtask A and experimented with a two-step prompting approach using LLMs -- namely, by first generating a text discussing the potential presence of each fallacy type and then producing a score indicating the likelihood of each of them being present, based on both the generated analysis and the original text of the post. The scores are either used to determine if a fallacy type has to be outputted (based on a tuned threshold value; run 3) or as features, together with topic information and statistical features about the distribution of scores across fallacy types, for training a multi-layer perceptron that models and predicts individual annotators' labels (run 1 and 2).

\paragraph{MALTO~\citep{team-malto}} The team participated in subtask A by fine-tuning an encoder-based model pretrained on Italian data using a global multi-label classification threshold. The submitted runs reflect different hyper-parameter configurations and data composition for fine-tuning. Specifically, in run 2, the team used a binary cross-entropy loss without class weights, whereas in run 3, they used a binary cross-entropy loss with positive class weights and paraphrase-based augmented data for fine-tuning.\footnote{At the time of writing, we do not have details about run 1; we refer the reader to the MALTO's technical report.}

\paragraph{PuDy~\citep{team-pudy}} 
The team participated in subtask B. They adopted a span-based modeling approach using an encoder-based model and representing spans of up to 20 tokens as the concatenation of boundary span embeddings and a learned span length embedding. The system employs a hierarchy-based label propagation strategy, making fallacy types that are immediate parents of the gold fallacy sub-types to receive supervision. Moreover, the surrounding context of fallacious spans is perturbed using an LLM (in a 2-shot setting) to reduce overfitting to contextual lexical cues. The three runs reflect different hyper-parameter and data configurations of the same system.

\paragraph{RBG-AI~\citep{team-rbgai}} The team participated in both subtasks with three runs each using a unified prompt-based framework. They tested instruction-tuned LLMs in few-shot settings (namely, using 3-, 5-, and 10-shots), selecting examples that maximize diversity in terms of fallacy types, multi-label instances, and annotation views. During prediction, outputs are bounded to coarse-grained groups of fallacy types derived from semantic closeness and corpus-level distributional statistics of \textsc{Faina}'s fallacy inventory.\footnote{At the time of writing, we do not have details on the exact approach adopted in each run; we refer to the RBG-AI's report.}

\paragraph{TiGRO~\citep{team-tigro}} The team submitted three runs for each of the subtasks. For subtask A, they experimented with a one-vs-rest strategy by fine-tuning 20 binary classifiers -- one per fallacy type -- using a multilingual encoder (run 1), and with a multi-task learning approach, using an encoder pretrained on Italian data and a decoder per fallacy type, either considering post-level annotations (run 2) or jointly accounting for post- and span-level labels using a total of 40 decoders (run 3). 
For subtask B, the team employed a multi-task learning approach with 20 span-level decoders and an encoder pretrained on Italian data (run 1), employed the same system used for run 3 in subtask A (run 2), and experimented with a variant of this system by substituting the backbone encoder with a multilingual one (run 3).

\paragraph{UNICA~\citep{team-unica}} The team participated in subtask A and proposed different approaches based on fine-tuning and retrieval-augmented generation (RAG). All approaches used training data that was previously augmented via LLM prompting and back-translation. In run 1, a closed-weight LLM from the OpenAI family was instructed with few-shot examples. The examples were dynamically selected based on their semantic similarity with the input text and through RAG, using a closed-weight text embedding model from the same family. In run 2 and 3, fine-tuning of LLMs was conducted using models from the Gemma and Mixtral families, respectively.

\subsection{Results} \label{sec:results}

In this section, we provide the results on the official test set for all runs submitted by participant teams in the post-level (Section~\ref{sec:results-a}) and span-level (Section~\ref{sec:results-b}) fallacy detection setups.

\subsubsection{Subtask A: Post-level fallacy detection} \label{sec:results-a} 

The results on the test set for all the runs submitted by teams participating in subtask A are reported in Table~\ref{tab:results-a}. By looking at micro-averaged F$_1$ scores, the \emph{TiGRO} team achieves the best results on the task (56.39 micro F$_1$; run 3) by modeling post- and span-level annotations jointly in a multi-task learning framework. \emph{MALTO} follows it with a system that uses data augmentation via paraphrasing and takes into account class imbalance in the loss computation (54.63 micro F$_1$; run 2). The other runs by \emph{TiGRO} place third (52.41 micro F$_1$; run 2) and fifth (48.65 micro F$_1$; run 1), with a multi-task learning approach considering post-level annotations and by employing a one-vs-rest strategy, respectively. Interestingly, all these systems do not rely on decoder-based LLMs but on encoder-based models, confirming what has been observed in previous work about the challenges of this task for LLMs~\citep{alhindi-etal-2022-multitask,ramponi-etal-2025-fine}. The submitted run that ranks the highest among those using LLMs is run 1 by \emph{UNICA} (49.71 micro F$_1$). It places fourth by instructing models from the OpenAI family in a few-shot manner, using examples that were dynamically selected using semantic similarity and retrieval-augmented generation methods. 

\begin{table}[h]
  \centering
  \caption{Test set results for subtask A (\emph{post-level fallacy detection}). P: Precision, R: Recall, F$_{1}$: F$_{1}$ score. All scores are reported in their micro- and macro-averaged flavors. Runs are ranked by decreasing micro-averaged F$_{1}$ score, and the best score for each metric is in bold. Baselines are highlighted in yellow. `*' indicates late submissions.}
  % \resizebox{1\linewidth}{!}{%
  \begin{tabular}{clccrracrra}
    \toprule
       &  &  &  & \multicolumn{3}{c}{\textbf{\footnotesize{micro-averaged}}} &  & \multicolumn{3}{c}{\textbf{\footnotesize{macro-averaged}}} \\
       & \textbf{Team} & \textbf{Run} & & \textbf{P} & \textbf{R} & \multicolumn{1}{r}{\textbf{F$_{1}$}} &  & \textbf{P} & \textbf{R} & \multicolumn{1}{r}{\textbf{F$_{1}$}} \\
      \midrule
       1 & TiGRO & \footnotesize{3} & & 
       53.24 & 59.99 & \textbf{56.39} & & 
       34.47 & 34.74 & 33.35 \\
       2 & MALTO & \footnotesize{2} & & 
       55.75 & 53.60 & 54.63 & & 
       41.95 & 30.95 & 32.63 \\
       3 & TiGRO & \footnotesize{2} & & 
       53.43 & 51.48 & 52.41 & & 
       35.91 & 27.21 & 27.94 \\
       4 & UNICA & \footnotesize{1} & & 
       55.22 & 45.23 & 49.71 & & 
       \textbf{47.13} & 37.28 & \textbf{36.75} \\
       5 & TiGRO & \footnotesize{1} & & 
       \textbf{62.52} & 39.85 & 48.65 & & 
       27.46 & 18.11 & 20.57 \\
       6 & UNICA & \footnotesize{3} & & 
       51.19 & 44.26 & 47.45 & & 
       35.08 & 24.95 & 26.14 \\
       7 & Kenji-Endo & \footnotesize{1} & & 
       49.38 & 45.55 & 47.37 & & 
       14.25 & 17.17 & 13.71 \\
       8 & UNICA & \footnotesize{2} & & 
       58.70 & 38.44 & 46.44 & & 
       35.68 & 19.53 & 22.76 \\
       9 & MALTO & \footnotesize{1} & & 
       46.56 & 43.65 & 45.05 & & 
       28.11 & 23.33 & 23.17 \\
       \midrule
       \rowcolor{yellow} & \emph{\textsc{MVML-alb}} &  & & 
      \emph{64.29} & \emph{34.41} & \emph{44.82} & & 
      \emph{37.80} & \emph{15.42} & \emph{19.68} \\
      \midrule
       10 & RBG-AI & \footnotesize{2} & & 
       33.09 & 57.78 & 42.07 & & 
       26.54 & 46.21 & 29.32 \\
       *  & MALTO & \footnotesize{3} & & 
       37.45 & 45.04 & 40.88 & & 
       23.53 & 30.02 & 25.64 \\
       11 & RBG-AI & \footnotesize{3} & & 
       30.65 & 57.62 & 40.00 & & 
       31.08 & 54.90 & 31.31 \\
       12 & Label & \footnotesize{3} & & 
       27.60 & \textbf{68.08} & 39.26 & & 
       22.01 & \textbf{56.97} & 29.04 \\
       13 & Label & \footnotesize{1} & & 
       52.82 & 30.94 & 38.96 & & 
       14.50 & 10.13 & 10.31 \\
       14 & RBG-AI & \footnotesize{1} & & 
       36.35 & 41.11 & 38.57 & & 
       32.77 & 29.60 & 23.42 \\
       15 & Label & \footnotesize{2} & & 
       52.76 & 30.11 & 38.32 & & 
       14.62 & 10.17 & 10.82 \\
      \midrule
      \rowcolor{yellow} & \emph{\textsc{MVML-umb}} &  & & 
      \emph{38.53} & \emph{14.28} & \emph{20.84} & & 
      \emph{15.13} & \emph{3.45} & \emph{5.10} \\
    \bottomrule
  \end{tabular}
  % }%
  \label{tab:results-a}
\end{table}

In contrast, when looking at macro-averaged scores, \emph{UNICA} ranks first (36.75 macro F$_1$; run 1). By looking at the per-fallacy scores (Figure~\ref{fig:per-class-scores-a}), \emph{UNICA} (run 1) obtains more balanced scores across fallacy types compared to other teams that ranked higher according to the micro F$_1$ metric. The high macro F$_1$ score obtained by the system could therefore be attributed to the good performance achieved on fallacy types that are under-represented in the data (e.g., \emph{Causal oversimplification}, \emph{Slippery slope}), which fine-tuned encoder-based models typically struggle to capture due to the limited number of instances available for training. Nevertheless, the winning \emph{TiGRO} run and the run by \emph{MALTO} that ranked second still take the second (33.35 macro F$_1$; run 3) and the third (32.63 macro F$_1$; run 2) place when looking at macro F$_1$ scores, respectively, indicating high robustness and showing good performance for most under-represented classes (see~Figure~\ref{fig:per-class-scores-a}). Finally, the competitive performance of \emph{RBG-AI} (31.31 and 29.32 macro F$_1$; run 3 and 2) and \emph{Label} (29.04 macro F$_1$; run 3) in terms of macro-averaged scores, despite placing eleventh, tenth, and twelfth according to micro F$_1$, respectively, is likewise attributable to the good performance on minority fallacy categories and little performance degradation on majority labels such as \emph{Loaded language} and \emph{Name calling or labeling}. All runs outperform the \textsc{MVML-umb} baseline, whereas \textsc{MVML-alb} is still competitive, especially when looking at micro-averaged scores.

Further details on per-class scores obtained by participant teams' runs can be found in Figure~\ref{fig:per-class-scores-a}.

\begin{figure}[h]
     \centering
         \includegraphics[width=1\linewidth]{img/matrix_a.pdf}
         \caption{Test set results divided by fallacy type for subtask A (\emph{post-level fallacy detection}) in terms of F$_1$ score. Participant teams (with run numbers within parentheses) are on the rows, and fallacy types are on the columns. \textsc{ah}: Ad hominem; \textsc{aa}: Appeal to authority; \textsc{ae}: Appeal to emotion; \textsc{co}: Causal oversimplification; \textsc{cp}: Cherry picking; \textsc{cr}: Circular reasoning; \textsc{do}: Doubt; \textsc{ep}: Evading the burden of proof; \textsc{fa}: False analogy; \textsc{fd}: False dilemma; \textsc{fw}: Flag waving; \textsc{hg}: Hasty generalization; \textsc{ll}: Loaded language; \textsc{nc}: Name calling or labeling; \textsc{rh}: Red herring; \textsc{ss}: Slippery slope; \textsc{sl}: Slogan; \textsc{st}: Strawman; \textsc{tc}: Thought-terminating cliché; \textsc{va}: Vagueness.}
         \label{fig:per-class-scores-a}
\end{figure}

\subsubsection{Subtask B: Span-level fallacy detection} \label{sec:results-b} 
Test set results for all runs submitted for subtask B are shown in Table~\ref{tab:results-b}. 
All teams outperform the \textsc{MVMD-umb} baseline, but only the \emph{TiGRO} team achieves higher performance than \textsc{MVMD-alb} in all runs in the \emph{strict} mode.
According to the official shared task metric (micro-averaged span-level F$_1$ score in the \emph{soft} mode), all the runs by the \emph{PuDy} team ranked first, followed by those by \emph{TiGRO} and those by \emph{RBG-AI}. Specifically, the best system (50.92 micro F$_1$, \emph{soft}; \emph{PuDy}, run 1) uses a span-based modeling approach and relies on the UmBERTo encoder-based model. By looking at micro-averaged scores in the \emph{strict} mode (i.e., when requiring the prediction of the exact fallacy types, without granting partial scores for non-severe errors, see Section~\ref{sec:metrics}), we observe that the best system is the multi-task model with 40 decoders and mmBERT as encoder by \emph{TiGRO} (42.13 micro F$_1$, \emph{strict}; run 3). According to this metric, this \emph{TiGRO} system outperforms the scores of the best \emph{PuDy} system (31.97 micro F$_1$, \emph{strict}; run 1). When looking at macro-averaged scores, \emph{TiGRO} achieves the best results (26.05 macro F$_1$, \emph{strict}; run 1), as also shown by individual fallacy scores in Figure~\ref{fig:per-class-scores-b}. Overall, we observe that systems by both \emph{PuDy} and \emph{TiGRO} are competitive and can be used for different use cases. For instance, if we have no strict requirements on the identification of exact fallacy categories, \emph{PuDy}'s one would be the system to go. If we are interested in recognizing precise fallacy types, one would prefer the \emph{TiGRO} system instead. 

More in general, the choice of a system often depends on the fallacy types of interest. Further details on the scores for each fallacy type obtained by participant teams' runs can be found in Figure~\ref{fig:per-class-scores-b}.

\begin{table}[h]
  \centering
  \caption{Test set results for subtask B (\emph{span-level fallacy detection}). P: Precision, R: Recall, F$_{1}$: F$_{1}$ score (span-level variants). All scores are reported in both \emph{strict} and \emph{soft} evaluation modes in their micro- and macro-averaged flavors, where applicable. Runs are ranked by decreasing micro-averaged F$_{1}$ score in the \emph{soft} evaluation mode, and the best score for each metric is in bold. Baselines are highlighted in yellow.}
  \resizebox{1\linewidth}{!}{%
  \begin{tabular}{clccrracrracrra}
    \toprule
       &  &  &  & \multicolumn{7}{c}{\textbf{\textsc{strict mode}}} &  & \multicolumn{3}{c}{\textbf{\textsc{soft mode}}} \\
       \cline{5-11} \cline{13-15}
       &  &  &  & \multicolumn{3}{c}{\textbf{\footnotesize{micro-averaged}}} &  & \multicolumn{3}{c}{\textbf{\footnotesize{macro-averaged}}} &  & \multicolumn{3}{c}{\textbf{\footnotesize{micro-averaged}}} \\
       & \textbf{Team} & \textbf{Run} &  & \textbf{P} & \textbf{R} & \multicolumn{1}{r}{\textbf{F$_{1}$}} &  & \textbf{P} & \textbf{R} & \multicolumn{1}{r}{\textbf{F$_{1}$}} &  & \textbf{P} & \textbf{R} & \multicolumn{1}{r}{\textbf{F$_{1}$}} \\
      \midrule
       1 & PuDy & \footnotesize{1} &  & 
       27.68 & 37.83 & 31.97 &  & 
       26.24 & 22.39 & 21.11 &  & 
       43.76 & 60.88 & \textbf{50.92} \\
       2 & PuDy & \footnotesize{2} &  & 
       25.96 & \textbf{40.80} & 31.73 &  & 
       19.24 & 23.73 & 18.95 &  & 
       40.30 & \textbf{64.79} & 49.69 \\
       3 & PuDy & \footnotesize{3} &  & 
       29.90 & 32.96 & 31.36 &  & 
       20.99 & 18.83 & 18.20 &  & 
       45.75 & 52.36 & 48.83 \\
       4 & TiGRO & \footnotesize{3} &  & 
       \textbf{47.82} & 37.67 & \textbf{42.13} &  & 
       \textbf{35.69} & 23.23 & 25.79 &  & 
       \textbf{51.01} & 40.25 & 44.98 \\
       5 & TiGRO & \footnotesize{1} &  & 
       38.33 & 40.35 & 39.30 &  & 
       31.52 & \textbf{25.30} & \textbf{26.05} &  & 
       42.50 & 45.20 & 43.80 \\
       6 & TiGRO & \footnotesize{2} &  & 
       38.23 & 40.05 & 39.11 &  & 
       29.85 & 24.16 & 24.76 &  & 
       42.68 & 44.87 & 43.74 \\
       \midrule
       \rowcolor{yellow} & \emph{\textsc{MVMD-alb}} &  &  & 
      \emph{48.83} & \emph{26.87} & \emph{34.66} &  & 
      \emph{36.13} & \emph{16.42} & \emph{20.87} &  & 
      \emph{52.98} & \emph{29.48} & \emph{37.89} \\
       \midrule
       7 & RBG-AI & \footnotesize{3} &  & 
       19.47 & 25.25 & 21.99 &  & 
       17.68 & 16.43 & 15.18 &  & 
       24.18 & 32.44 & 27.71 \\
       8 & RBG-AI & \footnotesize{2} &  & 
       19.21 & 18.24 & 18.71 &  & 
       17.52 & 12.98 & 12.66 &  & 
       24.67 & 23.96 & 24.31 \\
       9 & RBG-AI & \footnotesize{1} &  & 
       17.13 & 11.01 & 13.41 &  & 
       16.66 & 7.67 & 9.55 &  & 
       20.70 & 13.27 & 16.17 \\
      \midrule
      \rowcolor{yellow} & \emph{\textsc{MVMD-umb}} &  &  & 
      \emph{60.94} & \emph{3.05} & \emph{5.80} &  & 
      \emph{10.51} & \emph{3.21} & \emph{3.80} &  & 
      \emph{65.97} & \emph{3.28} & \emph{6.25} \\
    \bottomrule
  \end{tabular}
  }%
  \label{tab:results-b}
\end{table}

\begin{figure}[h!]
     \centering
         \includegraphics[width=1\linewidth]{img/matrix_b.pdf}
         \caption{Test set results divided by fallacy type for subtask B (\emph{span-level fallacy detection}) in terms of F$_1$ score (span-level variant, \emph{strict} evaluation mode). Participant teams (with run numbers within parentheses) are on the rows, and fallacy types are on the columns. \textsc{ah}: Ad hominem; \textsc{aa}: Appeal to authority; \textsc{ae}: Appeal to emotion; \textsc{co}: Causal oversimplification; \textsc{cp}: Cherry picking; \textsc{cr}: Circular reasoning; \textsc{do}: Doubt; \textsc{ep}: Evading the burden of proof; \textsc{fa}: False analogy; \textsc{fd}: False dilemma; \textsc{fw}: Flag waving; \textsc{hg}: Hasty generalization; \textsc{ll}: Loaded language; \textsc{nc}: Name calling or labeling; \textsc{rh}: Red herring; \textsc{ss}: Slippery slope; \textsc{sl}: Slogan; \textsc{st}: Strawman; \textsc{tc}: Thought-terminating cliché; \textsc{va}: Vagueness.}
         \label{fig:per-class-scores-b}
\end{figure}


\section{Analysis and discussion} \label{sec:analysis-discussion}

\paragraph{Models} All participant teams used transformer-based language models as part of their systems. Among encoder-based models, \emph{MALTO} used AlBERTo~\citep{polignano-etal-2019-alberto}, \emph{PuDy} employed UmBERTo~\citep{parisi-etal-2020-umberto}, and \emph{TiGRO} used both AlBERTo and mmBERT~\citep{mmbert-marone-etal-2025} in their runs. As regards decoder-based models, open-weight LLMs such as Gemma 3 12B~\citep{gemma-gemmateam-etal-2025} and LLaMa 3.1 8B~\citep{llama-grattafiori-etal-2024} have been used by \emph{Label} and \emph{RBG-AI} teams, respectively, as well as Mixtral 8x7B~\citep{jiang-etal-2024-mixtral} and Gemma 3 12B~\citep{gemma-gemmateam-etal-2025} by \emph{UNICA}. Closed-weight models have been used by \emph{UNICA} (i.e., GPT-5, GPT-5.1, and text-embedding-3-small), \emph{PuDy} employed Gemini~\citep{gemini-geminiteam-etal-2025} for the contextual enrichment phase of their systems, whereas \emph{MALTO} used ChatGPT for paraphrase-based data augmentation. \emph{TiGRO} is the only team that used multi-task learning in their runs by leveraging the MaChAmp toolkit~\citep{van-der-goot-etal-2021-massive}, showing improvements in performance compared to a single task setup. Finally, the \emph{Kenji-Endo} team trained a causal language model based on the Qwen3 architecture~\citep{qwen3-yang-etal-2025}. Given the different strengths of encoder- and decoder-based models in the task (Section~\ref{sec:results}), studying the interplay among them is a valuable direction for future work.

\paragraph{Human label variation and extra-linguistic information} Although we provide the training/development set with parallel annotators' labels as well as topic and time period metadata for each post, this information has not been extensively leveraged by participant teams. The only exception is the \emph{Label} team, that explicitly used the genuine disagreement in the \textsc{Faina} data for training their classifier and predicting annotator-specific fallacy labels. They also experimented by using topic information, demonstrating that it is a useful feature for fallacy classification. We expect that future work will explore more approaches in this direction, embracing both human label variation~\citep{plank-2022-problem} and addressing out-of-distribution generalization~\citep{ramponi-plank-2020-neural} by leveraging topic and time period information in \textsc{Faina} data.

\paragraph{Data augmentation} Three teams used data augmentation strategies: \emph{MALTO}, \emph{PuDy}, and \emph{UNICA}. Specifically, \emph{MALTO} (run 3) employed ChatGPT to generate paraphrases of selected training data instances (i.e., those with fallacy labels appearing $<100$ times in the training set annotations by $\mathcal{A}_1$) by preserving the same post-level labels. Along with original data instances, they then used augmented posts to fine-tune their model for run 3. \emph{PuDy} used Gemini for perturbing the context of the span (i.e., the text before and after it) to be classified. While \emph{PuDy}'s approach seems promising for our task, \emph{MALTO}'s one led to performance degradation. As noted by \emph{MALTO}, their approach is likely to introduce label noise, affecting learning. To increase the chance that labels associated to generated paraphrases are still applicable, in future work a classifier trained on gold data can be used for further validation, as previously done for other tasks such as hate speech detection~\citep{casula-etal-2024-delving,wullach-etal-2021-fight-fire}. Finally, all runs by \emph{UNICA} used augmented training data obtained by instructing GPT-5.1 to generate additional examples for minority fallacy types (i.e., 50 posts for each fallacy type appearing $<3\%$ of the times in the original training set). Moreover, they further augmented training data by back-translating -- with Spanish and French as pivot languages -- a random subset of the training data using OPUS-MT translation models~\citep{tiedemann-thottingal-2020-opus}, discarding augmented instances that exhibited a cosine similarity $<75\%$ compared to their original counterparts. 

Overall, data augmentation leads to mixed results and its efficacy depends on the specifics of each approach. Besides augmentation, future work can consider to enrich existing instances with additional layers of information to be leveraged, such as check-worthiness~\citep{daffara-etal-2025-worthit} and argumentation schemes~\citep{goffredo-etal-2023-argument}.


\section{Conclusions} \label{sec:conclusions}

This paper provided an overview of \name, the first shared task on fallacy detection in Italian social media posts organized as part of Evalita 2026. \name~attracted notable interest from the research community, registering a total of 25 submitted runs by 7 participant teams from institutions across five different countries. The results of the shared task and our analysis suggest that there is still ample room for improvement in fallacy detection performance, especially for the more challenging yet analytically useful span-level setup. We hope that our shared task, the dataset, and the evaluation protocol will foster further research in fallacy detection with human label variation.

%%
%% The acknowledgments section is defined using the "acknowledgments" environment
%% (and NOT an unnumbered section). This ensures the proper
%% identification of the section in the article metadata, and the
%% consistent spelling of the heading.
\begin{acknowledgments}
  This work has been funded by the European Union's Horizon Europe research and innovation programmes under grant agreement No.~101070190 (AI4Trust) and under the Marie Skłodowska-Curie grant agreement No.~101073351 (HYBRIDS).
\end{acknowledgments}

%% The declaration on generative AI comes in effect
%% in Janary 2025. See also
%% https://ceur-ws.org/GenAI/Policy.html
\section*{Declaration on Generative AI}
  %{\em Either:}\newline
  The author(s) have not employed any Generative AI tools.
  %\newline
  
  %\noindent{\em Or (by using the activity taxonomy in ceur-ws.org/genai-tax.html):\newline}
  %During the preparation of this work, the author(s) used X-GPT-4 and Gramby in order to: Grammar and spelling check. Further, the author(s) used X-AI-IMG for figures 3 and 4 in order to: Generate images. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content. 

%%
%% Define the bibliography file to be used
\bibliography{sample-ceur}

%%
%% If your work has an appendix, this is the place to put it.
% \appendix

% \section{Online Resources}

\end{document}

%%
%% End of file
