%% The first command in your LaTeX source must be the \documentclass command.
%%
%% Options:
%% twocolumn : Two column layout. Do not use twocolumn for papers submitted to CEUR-WS!
%% hf: enable header and footer.
\documentclass[
% twocolumn,
% hf,
]{ceurart}

%%
%% One can fix some overfulls
\sloppy

%%
%% Minted listings support 
%% Need pygment <http://pygments.org/> <http://pypi.python.org/pypi/Pygments>
\usepackage{listings}
\usepackage{array}
% \usepackage{tabularx}

% CUSTOM PACKAGES
\usepackage{bbm}

%% auto break lines
\lstset{breaklines=true}

%%
%% end of the preamble, start of the body of the document source.
\begin{document}

%%
%% Rights management information.
%% CC-BY is default license.
\copyrightyear{2026}
\copyrightclause{Copyright for this paper by its authors.
  Use permitted under Creative Commons License Attribution 4.0
  International (CC BY 4.0).}

%%
%% This command is for the conference information
\conference{EVALITA 2026: 9th Evaluation Campaign of Natural Language
Processing and Speech Tools for Italian, Feb 26 – 27, Bari, IT}

%%
%% Eulogy-related command
\newcommand{\blfootnote}[1]{%
  \begingroup
  \renewcommand{\thefootnote}{}%
  \footnote{#1}%
  \addtocounter{footnote}{-1}%
  \endgroup
}

%%
%% The "title" command
\title{PFB at EVALITA 2026: Overview of the Prometeia Financial Benchmark} 

% \tnotemark[1]
% \tnotetext[1]{You can use this document as the template for preparing your
%  publication. We recommend using the latest version of the ceurart style.}

%%
%% The "author" command and its associated commands are used to define
%% the authors and their affiliations.
\author[2]{Alessandro P. Bardelli}[%
orcid=0009-0009-7670-5668,
email=alessandro.bardelli@prometeia.com,
% url=https://yamadharma.github.io/,
]
%\cormark[1]
\fnmark[1]

\author[3]{Tolga Çekiç}[%
orcid=0009-0008-2061-5893,
email=tolga.cekic@prometeia.com,
% url=https://kmitd.github.io/ilaria/,
]
\fnmark[1]

\author[3]{Irem Demirtaş}[%
orcid=0009-0007-5586-9773,
email=irem.demirtas@prometeia.com,
% url=http://conceptbase.sourceforge.net/mjf/,
]
\fnmark[1]

\author[2]{Michele Filannino}[%
orcid=0000-0001-8208-2238,
email=michele.filannino@prometeia.com,
% url=https://yamadharma.github.io/,
]
%\cormark[1]
\fnmark[1]

\author[1]{Simona Scala}[%
orcid=0009-0003-5324-5466,
email=simona.scala@prometeia.com,
% url=https://yamadharma.github.io/,
]
\cormark[1]
\fnmark[1]

\author[4]{Andrea Galassi}[%
orcid=0000-0001-9711-7042,
email=a.galassi@unibo.it,
% url=https://yamadharma.github.io/,
]
%\cormark[1]
\fnmark[1]

\author[4]{Gianmarco Pappacoda}[%
orcid=0009-0001-6609-4156,
email=gianmarco.pappacoda@unibo.it,
% url=https://yamadharma.github.io/,
]
%\cormark[1]
\fnmark[1]

\author[4]{Paolo Torroni}[%
orcid=0000-0002-9253-8638,
email=p.torroni@unibo.it,
% url=https://yamadharma.github.io/,
]
%\cormark[1]
\fnmark[1]

\address[1]{Prometeia, Piazza Trento e Trieste 3, 40137, Bologna, Italy}
\address[2]{Prometeia, Via Brera, 18, 20121, Milan, Italy}
\address[3]{Prometeia, River Plaza, Kat 19 Büyükdere Caddesi Bahar Sokak No. 13, 34394, Levent | Istanbul | Turkey}
\address[4]{Università di Bologna, Dipartimento Informatica - Scienza e Ingegneria, Viale del Risorgimento 2, 40136, Bologna, Italy}

%% Footnotes
\cortext[1]{Corresponding author.}
\fntext[1]{These authors contributed equally.}

%%
%% The abstract is a short summary of the work to be presented in the
%% article.
\begin{abstract}
  The Prometeia Financial Benchmark (PFB) is the EVALITA 2026 shared task on finance questions across 3 languages: Italian, English, and Turkish, and 3 difficulty levels: easy, medium, and hard. The challenge is organized in two subtasks, one on Italian data and one on all three languages. For each subtask, we have received 2 submissions. Our main takeaways are that no significant performance differences stand out across languages and difficulty levels, and that PFB appears to be a challenging benchmark for models smaller than 3B, whereas 20B models already reach an overall accuracy of 90\%.
\end{abstract}

%%
%% Keywords. The author(s) should pick words that accurately describe
%% the work being presented. Separate the keywords with commas.
\begin{keywords}
  Large Language Models \sep
  Finance NLP \sep
  Multiple-choice QA \sep
  Multilingual benchmark \sep
  Prometeia Financial Benchmark 
\end{keywords}

%%
%% This command processes the author and affiliation and title
%% information and builds the first part of the formatted document.

\maketitle
\blfootnote{Dedicated to the memory of Annalina Caputo.}

\section{Introduction}
\label{sec:introduction}

Language Models (LMs) are increasingly adopted as general-purpose components for knowledge-intensive workflows, and finance is a natural target for their use \cite{dong2025large,khan2025bridging}. Finance-related language and reasoning, however, impose stricter requirements than many general-purpose benchmarks capture: terminology is specialized, concepts are tightly interdependent, and answers that appear plausible at the surface level may still be incorrect. Recent evidence highlights reliability limitations in finance-specific settings, including hallucination-related issues \cite{kang2023deficiency}, motivating domain-targeted evaluation protocols.

This paper presents the \textit{Prometeia Financial Benchmark} (PFB) shared task, organized as a track within EVALITA26~\cite{evalita2026overview}, to benchmark LMs on finance-domain multiple-choice question answering (MCQA). The task is based on a curated dataset of 1{,}500 questions released in aligned English, Italian, and Turkish versions, and is structured into an Italian-only and a multilingual subtrack. We describe the task definition, dataset construction, and evaluation protocol, and provide an overview of participant submissions and results.

%The remainder of the paper motivates the task (Section~\ref{sec:motivation}), defines the task and evaluation setup (Section~\ref{sec:task}), discusses results (Section~\ref{sec:results}), and concludes with directions for future work (Section~\ref{sec:conclusions}).

\section{Motivation}
\label{sec:motivation}

Finance is a high-stakes domain in which language technology is expected to support decision-making, compliance-driven workflows, and knowledge access over complex documentation. Although modern LMs perform well on broad benchmarks, their fluency can mask brittle understanding and overconfident errors \cite{guo2023chatgpt}. In finance, such failures are especially problematic: incorrect statements may propagate into reports, internal procedures, or user-facing advisory contexts, increasing operational and financial risk \cite{kang2023deficiency,winder2025biased}. A dedicated benchmark is therefore needed to quantify reliability on domain content rather than generic linguistic ability.

Many finance tasks require distinguishing between closely related concepts, interpreting definitions precisely, and selecting the only option that is fully consistent with the question context. MCQA provides a controlled evaluation setting: the decision space is explicit, distractors can be designed to be plausible within the domain, and scoring is unambiguous. This helps separate genuine domain comprehension from persuasive surface generation.

Multilinguality is an additional practical driver. Financial institutions operate across markets, and key materials are often consumed in local languages, not only in English. Yet most available benchmarks remain predominantly English-centric, making it difficult to disentangle domain effects from language effects. Multilingual, aligned resources enable controlled cross-lingual comparisons and support analyses of robustness, transfer, and language-specific terminology handling \cite{jorgensen2023multifin,peng2025multifinben,xue2024famma,zhang2024dolares}.

Finally, the shared-task format provides a transparent and reproducible comparison framework. It complements prior work on financial LLM evaluation and benchmarking \cite{xie2024finben,matlin2025financial,tatarinov2025language} by fixing data, protocol, and metrics, while encouraging diverse modelling approaches and facilitating systematic error analysis.

\section{Definition of the task}
\label{sec:task}

The shared task assesses a model’s ability to \emph{understand and reason over finance-domain content} in a controlled MCQA setting. Given a question and five candidate answers, systems must return the option that best answers the question. The task is open with respect to modelling choices: participants were free to submit systems based on different paradigms (e.g., encoder-only or decoder-only LMs, instruction-tuned LLMs, prompted systems) and to rely on either open or proprietary models, including both ``small'' and ``large'' language models.

For each test instance, participants submit a single answer choice in \{A,B,C,D,E\}. Systems may optionally provide short textual justifications. These explanations are not used for ranking, but they may support qualitative analyses (e.g., recurring failure modes, reasoning patterns, and domain-specific errors).

The shared task is organized into two subtracks. The first is an \textbf{Italian-only} track, which constitutes the primary focus within EVALITA. The second is a \textbf{multilingual} track covering Italian, English, and Turkish, designed to enable controlled comparisons across aligned instances. Participants may submit to one or both subtracks. %To contextualize results, we also report reference configurations, including standard zero-shot and few-shot prompting baselines; where appropriate, these span both large and small language models. The dataset and scoring protocol are described in Sections~\ref{sec:dataset} and~\ref{sec:metrics}, respectively.

For this task, participating teams were required to submit their primary run along with information about the model architecture, inference strategy, and related implementation specifics. Additionally, each team was allowed to submit up to four secondary runs.

\subsection{Dataset}
\label{sec:dataset}

%
\begin{table}[h]
\caption{Core fields in the PFB dataset release}
\label{tab:dataset_fields}
\centering
\setlength{\tabcolsep}{3pt}
\renewcommand{\arraystretch}{1.15}
\begin{tabular}{ll}
\toprule
\textbf{Field} & \textbf{Description} \\
\midrule
\texttt{custom\_id} & Unique identifier, shared across languages for alignment. \\
%\texttt{category} & Source-family label for provenance analyses. \\
\texttt{question} & Question stem. \\
\texttt{choiceA--choiceE} & Five candidate answers. \\
\texttt{correct\_answer} & Gold option in \{A,B,C,D,E\}. \\
\texttt{difficulty\_level} & Categorical difficulty in \{easy, medium, hard\}. \\
\bottomrule
\end{tabular}
\end{table}
%

The task is based on the \textit{Prometeia Financial Benchmark} (PFB) \cite{prometeia_pfb}, a curated collection of 1{,}500 finance-domain items released in three aligned languages (English, Italian, and Turkish). Questions were derived from heterogeneous finance-related sources (e.g., reports, papers, and regulatory texts) and curated with expert review to ensure domain relevance, internal consistency, and clarity.

Each instance is identified by a language-invariant \texttt{custom\_id}, enabling one-to-one alignment across the three languages. Provenance is captured by a \texttt{category} label indicating the source family. Items follow a standard MCQA schema with the stem (\texttt{question}), answer options (\texttt{choiceA}--\texttt{choiceE}), and the gold label \texttt{correct\_answer} in \{A,B,C,D,E\}. In addition, the dataset provides a categorical indicator, \texttt{difficulty\_level} (easy, medium, hard).
%ordinal difficulty indicator, \texttt{difficulty\_domain\_relevance} (1--5), which supports stratified reporting and fine-grained error analysis. In our analyses, we group them into three categories of difficulty: easy (1-3), medium (4), and hard (5).

\begin{table}[h]
\caption{Distribution of correct choices across labels in PFB dataset}
\label{tab:correct_choices}
\begin{tabular}{lccccc}
\toprule
\textbf{Label} & \textbf{A} & \textbf{B} & \textbf{C} & \textbf{D} & \textbf{E} \\
\midrule
\textbf{Number} & 402 & 438 & 424 & 237 & 0\\

\textbf{Percentage} & 26.8 & 29.1 & 28.3 & 15.8 & 0\\
\bottomrule
\end{tabular}
\end{table}

\begin{table}[h]
\caption{Distribution of difficulty in PFB dataset}
\label{tab:difficulty}
\begin{tabular}{lcccc}
\toprule
\textbf{Difficulty} & \textbf{Easy} & \textbf{Medium} & \textbf{Hard} \\
\midrule
\textbf{Number} & 248 & 1155 & 98 \\

\textbf{Percentage} & 16.6 & 76.9 & 6.5 \\
\bottomrule
\end{tabular}
\end{table}

Dataset construction followed a two-stage process: a large pool of candidate questions was first produced from the selected sources: financial texts (financial books, and regulatory texts), academic papers and financial reports (Table~\ref{tab:categories}). Then iteratively filtered and refined through expert review. Quality control relied on structured assessment signals recorded at the item level---including \textit{soundness}, \textit{ambiguity}, \textit{factuality}, and \textit{relevance}---complemented by free-text notes. These signals guided revisions of stems and distractors, removal of problematic items, and documentation of corner cases. The dataset was originally constructed with four answer options (A–D), each containing exactly one correct response. Later, a fifth option (E), labeled as “None of the above” (and its counterparts in Italian and Turkish datasets) was appended to all questions. This option was intentionally designed to be incorrect in every case and serves as a controlled distractor. Its inclusion enabled the evaluation of the models' susceptibility to uncertainty and their tendency to select generalized options.

\begin{table}[h]
\caption{Distribution of question categories in PFB dataset}
\label{tab:categories}
\begin{tabular}{lcccc}
\toprule
\textbf{Category} & \textbf{Financial Texts} & \textbf{Financial Papers} & \textbf{Financial Reports} \\
\midrule
\textbf{Number} & 327 & 339 & 335 \\

\textbf{Percentage} & 32.6 & 33.9 & 33.5 \\
\bottomrule
\end{tabular}
\end{table}

The benchmark is distributed in three languages with instance-level alignment. The Italian and Turkish versions were generated via automatic translation (DeepL Enterprise) and subsequently manually post-edited to correct domain terminology, preserve the semantics of both stems and answer options, and avoid language-specific artifacts that could inadvertently cue the correct answer. For the shared-task workflow, PFB is released with a public development split and a held-out test split used for final evaluation; all splits preserve the \texttt{custom\_id} alignment across languages. The dataset has two splits: 500 questions have been published for examples and 1001 questions have been used for testing. The splits are the same across all three languages.


In Table \ref{tab:correct_choices}, we show the distribution of the correct choices, in Table~\ref{tab:difficulty} the distribution of difficulty levels and in Table~\ref{tab:categories} the distribution of question categories. 





\subsection{Evaluation Measures}
\label{sec:metrics}


Systems are evaluated using \emph{accuracy}, i.e., the fraction of instances for which the selected option matches the gold label to be chosen among the available options. Given a test set $\mathcal{D}=\{(x_i,y_i)\}_{i=1}^{N}$, where $y_i\in \{A,B,C,D,E\}$ is the correct option and $\hat{y}_i$ is the system prediction, the primary score is:
\begin{equation}
\mathrm{Acc}(\mathcal{D}) = \frac{1}{N}\sum_{i=1}^{N}\mathbb{I}\left[\hat{y}_i = y_i\right].
\end{equation}
In addition to overall accuracy on the full test set, we also report accuracy on predefined \emph{difficulty-based subsets} derived from the item difficulty labels (e.g., easy vs.\ hard), to support diagnostic comparisons across systems.


\section{Results}
\label{sec:results}


% 1| G: Da portare sopra i participants maybe?
\subsection{Baselines}
As baselines, we use a similarity-based approach based on an encoder model and two small-sized LLMs:
\begin{itemize}
    \item \textbf{Similarity} (0.5B): we select the answer as the one most similar to the question. The similarity score is computed using a Sentence-BERT model~\cite{reimers-2019-sentence-bert}: \textit{distiluse-base-multilingual-cased-v1}.\footnote{https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1}% , a distilled version of Universal Sentence Encoder~\cite{yang2019multilingualuniversalsentenceencoder}. We select the correct answer based on the strongest similarity with the question as computed by the model.
    \item \textbf{Qwen} (1.7B)~\cite{yang2025qwen3technicalreport}: we use \textit{Qwen3-1.7B}\footnote{https://huggingface.co/Qwen/Qwen3-1.7B} (instruction tuned).
    \item \textbf{Llama} (3B)~\cite{grattafiori2024llama3herdmodels}: we use \textit{Llama-3.2-3B-Instruct}.\footnote{https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct}
\end{itemize}

For the LLMs, our prompt contains the question, the list of possible answer, and their labels in the following format:
% , and the string ``Answer: '' : ``Question: [question]{\textbackslash}n Choices:{\textbackslash}n letter: [choice]{\textbackslash}n {\textbackslash}n Answer: ''. We calculate the logarithmic probabilities of each option (A, B, C, D, E) using $\text{temperature}=0.5$ and take the highest as selected option.

\begin{quote}
    Question: \texttt{[question]} \\
    A: \texttt{[choiceA]}\\
    B: \texttt{[choiceB]}\\
    C: \texttt{[choiceC]}\\
    D: \texttt{[choiceD]}\\
    E: \texttt{[choiceE]}\\
    Answer: 
\end{quote}


All the baselines have been pre-trained, to some extent, in the three languages considered in this task: Italian, English, Turkish.



\subsection{Systems Overview}
We received two submissions. The \textbf{UNITOR} system by Borazio et al.~\cite{pfb_UNITOR} proposes an agentic architecture without any fine-tuning or domain adaptation that follows a three-stage orchestration pipeline. The first stage is semantic routing, used to distinguish between three different reasoning strategies: quantitative analysis, Boolean fact-checking, and knowledge-based research reasoning. Following routing, a specialized reasoning phase instantiates multiple parallel inference threads aimed at exploring parallel reasoning paths.  Finally, an aggregation stage uses majority voting to select the most stable answer under sampling, following \cite{tian2023finetuning}. An iterative refinement step may fire if the consensus is weak. During refinement, the reasoning modules are re-invoked with a constrained version of the original prompt, in which low-support options are explicitly masked. The authors evaluate this architecture on three open-weight models: LLaMA 3.1 (8B) as a lightweight baseline, GPT-OSS-20B (21B) as the primary reference model, and DeepSeek v3.1 (671B) as a large upper bound. The top performer in the validation set used for the official submission is DeepSeek 3.1 for English, and GPT-OSS-20B for Italian and Turkish, achieving about 0.88 average accuracy across all languages.

The \textbf{AMSN} system by Mohammadabad and Nazarmohsenifakori~\cite{pfb_AMSN} is the result of analysis of three different approaches: fine-tuning transformer-based models for question-answering, LLM prompting, and a hybrid approach. The first approach leverages mDeBERTa-v3. The second one is used to evaluate GPT-4o and GPT-5 with several prompting strategies. The hybrid approach uses a specialized trained model to detect and correct GPT-4o's mistakes. The top performer in the validation set used for the official submission is GPT-5, which achieves nearly 0.90 average accuracy across all languages. 


\subsection{Experimental Results}

\begin{table}[ht]
\caption{Results on the two subtasks}
\label{tab:subtasks}
\centering
\begin{tabular}{@{}lcc@{}}
\toprule
 & \textbf{Subtask 1} & \textbf{Subtask 2} \\
\midrule

AMSN~\cite{pfb_AMSN}       & 0.91 & 0.90 \\
UNITOR~\cite{pfb_UNITOR}      & 0.88 & 0.88 \\
Llama~\cite{grattafiori2024llama3herdmodels}      & 0.37 & 0.37 \\
Qwen~\cite{yang2025qwen3technicalreport}       & 0.33 & 0.29 \\
Similarity~\cite{reimers-2019-sentence-bert} & 0.21 & 0.20 \\

\bottomrule
\end{tabular}
\end{table}


\begin{table}[ht]
\caption{Breakdown of the results according to languages and difficulty of the instances (all, hard, medium, and easy)
}
\label{tab:detailed_results}
\resizebox{\textwidth}{!}{%
\begin{tabular}{@{}rrrrrrrrrrrrrrrrrrrrr@{}}
\toprule
% \multicolumn{1}{l}{\textbf{}} & \multicolumn{4}{c}{\textbf{subtask1}} & \multicolumn{16}{c}{\textbf{subtask2}} \\
Language & \multicolumn{4}{c}{\textbf{IT}} & \multicolumn{4}{c}{\textbf{EN}} & \multicolumn{4}{c}{\textbf{TR}} & \multicolumn{4}{c}{\textbf{ALL}} \\
\cmidrule(l){2-5} \cmidrule(l){6-9} \cmidrule(l){10-13} \cmidrule(l){14-17} 
Difficulty
& \textbf{A} & \textbf{H} & \textbf{M} & \textbf{E} 
& \textbf{A} & \textbf{H} & \textbf{M} & \textbf{E} 
& \textbf{A} & \textbf{H} & \textbf{M} & \textbf{E} 
& \textbf{A} & \textbf{H} & \textbf{M} & \textbf{E} \\ \midrule

AMSN~\cite{pfb_AMSN} 
& \textbf{0.91} & \textbf{0.92} & \textbf{0.92} & 0.86 
& \textbf{0.89} & 0.88 & \textbf{0.90} & 0.84 
& \textbf{0.88} & \textbf{0.88} & \textbf{0.89} & 0.88 
& \textbf{0.90} & \textbf{0.89} & \textbf{0.90} & 0.86 \\

UNITOR~\cite{pfb_UNITOR} 
& 0.88 & 0.84 & 0.88 & \textbf{0.89} 
& \textbf{0.89} & \textbf{0.91} & \textbf{0.90} & \textbf{0.88} 
& \textbf{0.88} & \textbf{0.88} & 0.88 & \textbf{0.89} 
& 0.88 & 0.88 & 0.89 & \textbf{0.89} \\

Llama~\cite{grattafiori2024llama3herdmodels}
& 0.37 & 0.41 & 0.36 & 0.38 
& 0.43 & 0.48 & 0.42 & 0.46 
& 0.31 & 0.33 & 0.31 & 0.29 
& 0.37 & 0.41 & 0.36 & 0.38 \\

Qwen~\cite{yang2025qwen3technicalreport} 
& 0.33 & 0.32 & 0.32 & 0.36 
& 0.26 & 0.25 & 0.26 & 0.29 
& 0.29 & 0.34 & 0.30 & 0.25 
& 0.29 & 0.30 & 0.29 & 0.30 \\

Similarity~\cite{reimers-2019-sentence-bert}
& 0.21 & 0.18 & 0.20 & 0.19 
& 0.21 & 0.20 & 0.21 & 0.21 
& 0.18 & 0.17 & 0.19 & 0.16 
& 0.20 & 0.18 & 0.20 & 0.19 \\

\bottomrule
\end{tabular}
}
\end{table}

\begin{table}[ht]
\caption{Breakdown of the results according to languages and category: Financial Texts (\textbf{T}), Financial Papers (\textbf{P}), Financial Reports (\textbf{R})}
\label{tab:category_results}
\begin{tabular}{@{}rrrrrrrrrrrrr@{}}
\toprule
Language & \multicolumn{3}{c}{\textbf{IT}} & \multicolumn{3}{c}{\textbf{EN}} & \multicolumn{3}{c}{\textbf{TR}} & \multicolumn{3}{c}{\textbf{ALL}} \\
\cmidrule(l){2-4} \cmidrule(l){5-7} \cmidrule(l){8-10} \cmidrule(l){11-13}
Difficulty
& \textbf{T} & \textbf{P} & \textbf{R} 
& \textbf{T} & \textbf{P} & \textbf{R} 
& \textbf{T} & \textbf{P} & \textbf{R} 
& \textbf{T} & \textbf{P} & \textbf{R} \\ \midrule
AMSN~\cite{pfb_AMSN} & 0.89 & 0.93 & 0.92 & 0.88 & 0.92 & 0.87 & 0.88 & 0.90 & 0.86 & 0.88 & \textbf{0.92} & 0.89 \\
UNITOR~\cite{pfb_UNITOR} & 0.88 & 0.86 & 0.90 & 0.88 & 0.88 & 0.92 & 0.87 & 0.86 & 0.91 & 0.88 & 0.87 & \textbf{0.91} \\
Llama~\cite{grattafiori2024llama3herdmodels} & 0.40 & 0.43 & 0.27 & 0.45 & 0.48 & 0.37 & 0.32 & 0.35 & 0.25 & 0.39 & 0.42 & 0.29 \\
Qwen~\cite{yang2025qwen3technicalreport} & 0.33 & 0.36 & 0.29 & 0.24 & 0.34 & 0.22 & 0.31 & 0.37 & 0.19 & 0.29 & 0.36 & 0.23 \\
Similarity~\cite{reimers-2019-sentence-bert} & 0.20 & 0.19 & 0.23 & 0.20 & 0.22 & 0.22 & 0.17 & 0.22 & 0.15 & 0.19 & 0.21 & 0.20 \\
\bottomrule
\end{tabular}
\end{table}

AMSN is the best-performing model on both subtasks, with an accuracy of 0.91 and 0.90, as shown in Table~\ref{tab:subtasks}.
UNITOR performs slightly worse, with a score of 0.88 in both tasks.
%
The baselines do not reach a satisfactory accuracy, with the best one, Llama, obtaining a score of 0.37.
The Similarity approach yields the worst result, with an accuracy of about 0.20, similar to random choice.

Table \ref{tab:detailed_results} shows detailed results. %across language and difficulties.
The performance of the participant systems is comparable across languages, and slightly different across difficulty levels. For instance, on Italian data, almost counterintuitively, AMSN performs better on hard questions (0.92) than on easy questions (0.86), whereas UNITOR performs better on the easy questions (0.89) than on the hard ones (0.84). On Turkish data, they both obtain a similar accuracy score across difficulty levels. Similar patterns are also observed in the baselines.



% While results for Llama and Qwen are comparable for their small size, the Similarity approach which uses a small encoder architecture, yields worse results, similar to a random choice (i.e. 1 in 5 is correct).

% Endof 1|

% Results across languages
% Analyzing the results from participants we can observe the usage of larger models compared to baselines that lead to better overall accuracy. The results from participants show no clear difference of performance among the three different languages.

% Results across difficulty
% Across difficulty, we can observe AMSN in Subtask 1 has a better score over hard questions than it has on easy ones, whereas UNITOR has the opposite. 


% Answer distribution
Analyzing the distribution of the answers across the 5 labels, we observe that models from participants do not exhibit any specific bias towards any label and the distribution of their answers follows the distribution of the gold standard.

% Categories
Table \ref{tab:category_results} shows results across language and categories. The performance is comparable across systems, however we observe a slight tendency of questions  from financial texts (T) to be harder for participants' systems and baselines. 

%\section{Discussion}
%\label{sec:discussion}

In general, the difference between the results of the participants is small across all subtasks, languages and categories.
%
The size of the underlying models is likely to have a significant impact on the performance, given the gap between approaches that use models above 20B and the baselines, based on models with 3B parameters or less. A remarkable result is that 20B models are competitive against much larger models.

%Across languages, the models do not show relevant variations in their performance.
%
%Across the languages and difficulty levels, we observe slight differences in the behaviour of the participants' approaches. . %Consequently, it is not possible to determine whether it is due to the underlying models or to the specifics of the approaches.

%In certain cases, these differences are counterintuitive: there are models that obtain a higher accuracy score on the difficult questions rather than the easy ones.

\section{Conclusions}
\label{sec:conclusions}

The \textit{Prometeia Financial Benchmark} (PFB) shared task was designed to benchmark language models on finance-domain multiple-choice question answering (MCQA) on three languages.
We received two full submissions from teams who explored enconder- and decoder-based systems of varying complexity, reaching an accuracy of about 90\%.
A comparison with the performance obtained by our baselines highlights that PFB is challenging for models smaller than 3B, and that 20B models may be adequate for the task.
%
As future work, we want to investigate more thoroughly the relationship between the models' efficiency and their accuracy. Moreover, we would like to explore the possible benefit of exploiting the difficulty label, which is a feature that the participants did not have, but may be used in ensemble approaches.


%% The declaration on generative AI comes in effect
%% in Janary 2025. See also
%% https://ceur-ws.org/GenAI/Policy.html
\section*{Declaration on Generative AI}
 During the preparation of this work, the authors used OpenAI ChatGPT 5.2 in order to: Paraphrase and reword, and Improve writing style. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the publication’s content.

%%
%% Define the bibliography file to be used
\bibliography{bibliography}

%%
%% If your work has an appendix, this is the place to put it.
\appendix



\end{document}

%%
%% End of file
