%% AAAI 2026 Submission Outline

\def\aaaianonymous{true}

\documentclass[letterpaper]{article} % DO NOT CHANGE THIS

% Conditional package loading
\ifdefined\aaaianonymous
    \usepackage[submission]{aaai2026}
\else
    \usepackage{aaai2026}
\fi

% Essential packages (do not remove)
\usepackage{times}
\usepackage{helvet}
\usepackage{courier}
\usepackage[hyphens]{url}
\usepackage{graphicx}
\urlstyle{rm}
\def\UrlFont{\rm}
\usepackage{natbib}
\usepackage{caption}
\usepackage{amsmath}
\usepackage{amssymb}
% Extra packages
\usepackage{booktabs}
\usepackage{array}
\usepackage{times}
\usepackage{helvet}
\usepackage{courier}
\usepackage{xcolor}
\frenchspacing
\setlength{\pdfpagewidth}{8.5in}
\setlength{\pdfpageheight}{11in}

% PDF metadata
\pdfinfo{
/TemplateVersion (2026.1)
/Title ()
/Author ()
}

% Title and author information (switch automatically)
\ifdefined\aaaianonymous
    \title{Anonymous Submission Title}
    \author{Anonymous Submission}
\else
    \title{<Paper Title>}
    \author{
        First Author\textsuperscript{\rm 1},
        Second Author\textsuperscript{\rm 2}
    }
    \affiliations{
        \textsuperscript{\rm 1}Affiliation One\\
        \textsuperscript{\rm 2}Affiliation Two\\
        \{email1,email2\}@example.com
    }
\fi

\newcommand{\tablesizeAAAI}{%
  % 9 pt text on 10.5 pt baselineskip = AAAI‑legal minimum
  \fontsize{9}{10.5}\selectfont
  % tighten inter‑column spacing (default is 6pt)
  \setlength{\tabcolsep}{3pt}%
}
\newcolumntype{Y}{>{\centering\arraybackslash}p{0.6cm}}
\newcommand{\isChecklistMainFile}{} 

% Disable section numbers (set to 1 if desired)
\setcounter{secnumdepth}{0}

\begin{document}

\maketitle

% -----------------------------
% OUTLINE STARTS HERE
% -----------------------------

\begin{abstract}
Zero-shot text classification (ZSC) offers the promise of eliminating costly task-specific annotation by matching texts directly to human-readable label descriptions. While early approaches have predominantly relied on cross-encoder models fine-tuned for natural language inference (NLI), recent advances in text-embedding models and rerankers have challenged the dominance of NLI-based architectures. Existing evaluations, such as MTEB, often probe embedding models with supervised classifiers atop frozen embeddings, leaving true zero-shot capabilities underexplored. To address this, we introduce \textbf{BTZSC}, a comprehensive benchmark of 22 public datasets spanning sentiment, topic, intent, and emotion classification, capturing diverse domains, class cardinalities, and document lengths. Leveraging BTZSC, we conduct a systematic comparison across three major model families, NLI cross-encoders, embedding models, and rerankers, encompassing 31 public and custom checkpoints. Our results show that: (i) modern rerankers, exemplified by \textit{Qwen3-Reranker-8B}, set a new state-of-the-art with macro F\textsubscript{1} = 0.72; (ii) strong embedding models such as \textit{GTE-large-en-v1.5} substantially close the accuracy gap while offering the best trade-off between accuracy and latency; (iii) NLI cross-encoders plateau even as backbone size increases; and (iv) scaling primarily benefits rerankers over embedding models. BTZSC and accompanying evaluation code are publicly released to support fair and reproducible progress in zero-shot text understanding.
\end{abstract}


\section{Introduction}

Text classification is a foundational problem in Natural Language Processing (NLP), finding broad applications across diverse domains, including topic categorization of news articles, intent detection in conversational agents, sentiment analysis of product reviews, and emotion recognition in mental health support systems \cite{Sebastiani2002TextCat,Kowsari2019Survey}. Formally, the task involves assigning one or more predefined labels to textual data based solely on the content of the text \cite{Sebastiani2002TextCat}. However, the supervised approach to text classification necessitates the creation of large-scale, high-quality annotated datasets, a process that is often prohibitively expensive, particularly in specialized domains requiring expert annotators \cite{Settles2012ActiveLearning}.

Text zero-shot classification (ZSC) addresses this challenge by enabling models to predict labels that have not been explicitly observed during training \cite{Yin2019ZSC}. The core principle underlying ZSC methods is the exploitation of semantic relationships between input texts and candidate labels. This relationship is typically captured using pretrained language models, which encode semantics based on extensive pretraining on large textual corpora \cite{Yin2019ZSC,Brown2020GPT3}. One straightforward approach involves prompting large autoregressive language models (LLMs) directly with textual inputs and candidate label descriptions. While effective, this method entails considerable computational cost and latency, limiting its feasibility in real-time deployment scenarios \cite{Brown2020GPT3}.

A widely adopted, more computationally efficient alternative involves fine-tuning pretrained encoder models on Natural Language Inference (NLI) datasets, reframing classification tasks as entailment problems. Specifically, the input text acts as a premise and each candidate label as a hypothesis sentence \cite{Yin2019ZSC,Bowman2015SNLI,Williams2018MNLI}. NLI datasets, including SNLI \cite{Bowman2015SNLI} and MultiNLI \cite{Williams2018MNLI}, contain sentence pairs annotated with labels indicating entailment, contradiction, or neutrality. By fine-tuning encoders on these corpora, models learn to discern semantic compatibility, thus enabling effective reuse in ZSC scenarios. Despite their success and lower computational demands relative to generative LLMs, improvements in NLI-based cross-encoder methods have plateaued in recent years.

Concurrent to this, significant advances have occurred in the domain of text-embedding models \cite{Reimers2019SBERT,Gao2021SimCSE,Muennighoff2023MTEB}. Embedding models learn mappings, \( f:\text{text}\rightarrow\mathbb{R}^d \), from textual inputs to dense vector representations, ensuring semantically related texts are closely situated in the embedding space. This characteristic facilitates efficient similarity-based retrieval, and in principle, supports ZSC through nearest-neighbor matching to candidate label embeddings \cite{Reimers2019SBERT,Gao2021SimCSE}. The Massive Text Embedding Benchmark (MTEB) systematically evaluates embedding models across various tasks, encompassing 58 datasets categorized into eight families \cite{Muennighoff2023MTEB}. However, classification performance within MTEB is primarily assessed through linear probes trained on labeled data atop frozen embeddings, thereby leaving the genuine zero-shot capabilities of embedding models untested \cite{Muennighoff2023MTEB}.

Another promising class of models, rerankers, originally cross-encoder or sequence-to-sequence architectures designed to refine the ranking of query-document pairs (e.g., MonoT5 \cite{Nogueira2020MonoT5}), can similarly be adapted for ZSC by treating textual inputs as queries and label descriptions as retrievable documents. However, the comparative performance and potential advantages of rerankers in zero-shot classification contexts remain underexplored.

Furthermore, the distinction between encoder-based and generative approaches is becoming increasingly blurred, as modern embedding models frequently leverage distilled or instruction-tuned variants of generative LLMs (e.g., Sentence-T5 \cite{Ni2021SentenceT5}, E5 \cite{Wang2024E5}). Given these rapid developments, a systematic comparison between NLI cross-encoders, contemporary embedding models, and reranker architectures, particularly in genuine zero-shot settings across diverse classification tasks, remains an open research question.

To address this gap, we present a comprehensive benchmark study evaluating a diverse selection of models, including NLI-based cross-encoders, embedding models, and rerankers, across 22 datasets that span four major classification categories (sentiment, topic, intent, and emotion). This benchmark systematically explores the relative strengths, limitations, and transferability of these approaches, offering a comparative analysis to guide future research directions in zero-shot text classification.


\section{Related Work}

To our knowledge, the proposed benchmark, \textbf{BTZSC}, is the first to comprehensively compare NLI cross-encoders, embedding-based models, and reranker architectures in a true ZSC setting. Previous benchmarks for ZSC have typically been limited in scope, often restricted to evaluating a single model family, a narrow task category, or a handful of datasets. For instance, \citet{Yin2019ZSC} introduced a foundational NLI-based ZSC benchmark but evaluated exclusively cross-encoder models on only three datasets. \citet{Chalkidis2020LMTC} examined zero-shot learning specifically within multi-label classification but confined their analysis to three hierarchical datasets. \citet{Gretz2023TTC} proposed TTC23, evaluating prompt-based methods solely for topic classification and omitted contemporary embedding and reranking models from their analysis. \citet{Lepagnol2024SmallLM} further explored the performance of smaller language models (100M–1B parameters) across 15 datasets, yet their work excluded comparisons with embedding and reranker architectures. The Massive Text Embedding Benchmark (MTEB), alongside its multilingual counterpart, has established a mature, broad-ranging evaluation platform covering numerous datasets. However, MTEB assesses classification performance via supervised linear probes trained atop frozen embeddings, thereby leaving unanswered the  question of embedding models' genuine zero-shot capability \cite{Muennighoff2023MTEB,Enevoldsen2025MMTEB,Chung2025MaintainMTEB}. Consequently, this fragmented state of evaluation has hindered a clear understanding of cross-family  comparative capabilities among these diverse model types.

\subsection{Zero-Shot Text Classification}

Zero-shot text classification fundamentally involves assigning labels unseen during training by assessing semantic compatibility between input texts and candidate labels, typically expressed in natural language. Unlike supervised approaches, ZSC methods avoid task-specific finetuning by leveraging pretrained models' semantic representations. A common parallel in vision tasks is zero-shot image recognition with language-aligned models like CLIP~\cite{Radford2021CLIP}, though textual classification benefits directly from the intrinsic expressivity and flexibility of natural language documents.

\textbf{NLI-based cross-encoders} represent one of the earliest and most prominent paradigms for zero-shot text classification. Such methods recast the classification problem into an entailment task, where each candidate label is paired with the input text as a hypothesis-premise pair scored by an NLI model \cite{Yin2019ZSC}. This approach has been operationalised effectively by public checkpoints like \texttt{facebook/bart-large-mnli} \cite{Lewis2020BART}, which powers the widely used zero-shot pipeline of HuggingFace Transformers \cite{HF2020Transformers}. More recent advances, including stronger encoder backbones like DeBERTa-v3 \cite{He2023DeBERTaV3} and improved label verbalization techniques, have incrementally enhanced performance. Nonetheless, these improvements have plateaued when compared with rapid advancements from increasingly large generative language models (LLMs).

\textbf{Text-embedding models} have subsequently emerged as a highly active research domain, evolving significantly from early sentence embedding techniques such as InferSent \cite{Conneau2017InferSent} and Google's Universal Sentence Encoder (USE) \cite{Cer2018USE}. Contemporary embedding frameworks, notably E5 \cite{Wang2024E5}, GTE \cite{Li2023GTE}, BGE \cite{Chen2024BGE}, and Qwen3-Embedding \cite{Zhang2025Qwen3Embedding}, have substantially raised performance standards. These models integrate sophisticated training strategies including billion-scale contrastive pretraining, multilingual supervision, multi-stage data scaling, and instruction fine-tuning. For example, E5 uses an instruction-tuned approach with massive-scale contrastive learning, GTE emphasizes data-scale expansion over parameter scale, and BGE combines dense, sparse, and multi-vector encoding techniques into a multilingual framework capable of handling extensive context lengths. Compared to foundational architectures such as SBERT \cite{Reimers2019SBERT}, these advancements have resulted in improvements on standard benchmarks such as MTEB, demonstrating enhanced performance in semantic representation tasks \cite{Muennighoff2023MTEB}. Additionally, embedding models increasingly incorporate distillation from or joint-training with large generative models, effectively blurring distinctions between encoder-based and generative paradigms.

\textbf{Reranker models}, originally developed for information retrieval tasks, represent another promising approach for ZSC. Early reranker architectures leveraged cross-encoder models like BERT \cite{Devlin2019BERT}, DPR’s combined bi-encoder and cross-encoder architecture \cite{Karpukhin2020DPR}, and late-interaction models such as ColBERT \cite{Khattab2020ColBERT}. These methods typically assign relevance scores to a set of candidate documents with respect to a given input query, enabling them to be ranked accordingly. Sequence-to-sequence reranker variants such as MonoT5 have further extended this paradigm by scoring pairs through generative token likelihood estimation, demonstrating effective transferability to new tasks \cite{Nogueira2020MonoT5}. Recent embedding model families like BGE now provide integrated reranker checkpoints, inheriting their multi-stage training procedures \cite{Chen2024BGE}.




\section{Benchmark for Textual Zero-Shot Classification (BTZSC)}
BTZSC presents a comprehensive, task-balanced evaluation suite for zero-shot text classification, aiming to serve as a benchmark for diverse model architectures. The datasets underpin five key criteria to ensure robustness and real-world relevance. First, ensuring task diversity by including at least two datasets for each of sentiment, topic, intent, and emotion classification, mirroring the four most prominent application families. Second, to probe the impact of class granularity, BTZSC covers binary, medium-sized (such as \textit{agnews} with four labels), and high-cardinality settings (for instance, \textit{banking77} with 77 labels). Third, we prioritized domain diversity, drawing from sources spanning news, social media, product reviews, encyclopedic content, and political discourse to assess model robustness under domain shift. Fourth, we incorporated a wide spectrum of document lengths, from micro-texts (under 20 tokens) to longer articles (over 250 tokens). The benchmark is limited to English datasets; multilingual evaluation is left for future work. the datasets overlap to large extent with the datasets used by \cite{LaurerZSC2023} for transfer learning in zero-shot classification. Table A.1 in the technical appendix provides the source and further details for each dataset.

BTZSC comprises 22 English datasets encompassing the aforementioned task types. As summarized in Table~\ref{tab:tbl_01_dss_stats}, each dataset is characterized by its number of classes, average token length\footnote{computed with the \texttt{answerdotai/ModernBERT} tokenizer}, and domain label (such as news, review, or social media). To quantify lexical overlap and domain similarity between datasets, we follow \cite{BEIR2021} and compute weighted Jaccard similarity by measuring token distribution overlaps for each dataset pair. The resulting $22 \times 22$ similarity matrix, shown in Figure~\ref{fig:fig_01_dss_jaq_sim}, highlights low overlap between different task types, reflecting strong lexical diversity across tasks. At the same time, we observe that datasets derived from similar sources tend to cluster more together, for example, all Wikipedia-based datasets form a distinct group, as do the biasframes-related datasets, demonstrating modest intra-source lexical similarity.

\begin{table*}[t]
\centering
\begin{tabular}{llrr}
\toprule
Domain & Dataset & Num Classes & Avg Token Count \\
\midrule
\multicolumn{4}{l}{\textbf{Emotion}}\\
dialogue      & empathetic            & 32 & 132 \\
social-media  & emotiondair           & 6  & 20  \\
\midrule
\multicolumn{4}{l}{\textbf{Intent}}\\
banking       & banking77             & 72 & 13  \\
social-media  & biasframes\_intent    & 2  & 27  \\
\midrule
\multicolumn{4}{l}{\textbf{Sentiment}}\\
apps          & appreviews            & 2  & 49  \\
e-commerce    & amazonpolarity        & 2  & 103 \\
finance       & financialphrasebank   & 3  & 29  \\
local-business& yelpreviews           & 2  & 164 \\
movies        & imdb                  & 2  & 293 \\
movies        & rottentomatoes        & 2  & 26  \\
\midrule
\multicolumn{4}{l}{\textbf{Topic}}\\
assistant     & massive               & 59 & 8   \\
education     & trueteacher           & 2  & 282 \\
news          & agnews                & 4  & 54  \\
politics      & capsotu               & 21 & 44  \\
politics      & manifesto             & 56 & 45  \\
qa-forum      & yahootopics           & 10 & 137 \\
social-media  & biasframes\_offensive & 2  & 27  \\
social-media  & biasframes\_sex       & 2  & 28  \\
wikipedia     & wikitoxic\_insult     & 2  & 93  \\
wikipedia     & wikitoxic\_obscene    & 2  & 91  \\
wikipedia     & wikitoxic\_threat     & 2  & 99  \\
wikipedia     & wikitoxic\_toxicaggregated & 2 & 86 \\
\bottomrule
\end{tabular}
\caption{Summary statistics of BTZSC datasets.}
\label{tab:tbl_01_dss_stats}
\end{table*}



\begin{figure}[t]
    \centering
    \includegraphics[width=\linewidth]{./figs/fig_01_dss_jaq_sim.pdf} 
    \caption{Pairwise weighted Jaccard similarity between datasets.}
    \label{fig:fig_01_dss_jaq_sim}
\end{figure}

\subsection{Evaluation Metrics}
To make results comparable across all BTZSC tasks and model families we adopt a \emph{single, task-agnostic primary metric}: \textbf{macro~F\textsubscript{1}}.  
Macro averaging gives equal weight to every class irrespective of its frequency, making it appropriate for both binary and multi-class datasets with varying label set cardinalities \cite{Sokolova2009}.  
We additionally report (micro) \textbf{accuracy}, since it remains the most common headline number in the classification literature and is straightforward to interpret.

Finally, to probe whether success on natural-language inference transfers to zero-shot classification, we evaluate each model on standard NLI benchmarks and report the \textbf{AUROC}. AUROC is threshold-free and does not require calibrated probabilities; because cosine-similarity scores lie in $[-1,1]$ rather than representing probabilities, AUROC lets us test whether entailment pairs consistently receive higher similarity than neutral/contradiction pairs.


\subsection{Model Types}

We categorize the models evaluated in this study according to their underlying architecture and training strategies.

\textbf{Transformer Base Models.}
As a baseline, we include transformer-based encoder models that have not been further fine-tuned for any specific downstream task. For these models, the final \textit{[CLS]} token representation is extracted and cosine similarity is used to compute the relevance between the input text and each candidate label. The base models considered in this category are the original BERT (\textit{bert-large-uncased} \cite{Devlin2019BERT}), the increasingly adopted ModernBERT (\textit{ModernBERT-large} \cite{ModernBERT2024}), and DeBERTa-v3 (\textit{deberta-v3-large} \cite{He2023DeBERTaV3}), a popular and robust modification of BERT that has demonstrated strong performance on a variety of NLP benchmarks.

\textbf{NLI-based Cross-Encoders.}
These models are trained on NLI datasets and perform classification by assessing the degree of entailment between an input text and each candidate label, formulated as a premise–hypothesis pair. \textit{BART-Large-MNLI} is included as the canonical representative, being the first widely used NLI-based cross-encoder for zero-shot classification. We also consider \textit{NLI-RoBERTa-base} as well as a set of custom-trained cross-encoders using \textit{BERT}, \textit{DeBERTa-v3}, and \textit{ModernBERT} backbones. Both base and large versions are evaluated to analyze the effect of model scale, and two loss variants are tested to assess the impact of training objectives. Full details of the training procedure are provided the technical appendix. In total, 11 NLI-based cross-encoders are benchmarked, covering the most widely used configurations in the literature.

\textbf{Embedding Models.}
This category comprises models optimized to produce fixed-size vector representations of text for a range of downstream tasks, including classification. As a canonical embedding model, \textit{all-MiniLM-L6-v2} \cite{Reimers2019SBERT} is serves as a baseline for this model family. Additionally, we evaluate both base and large variants of BGE, GTE, and E5, all of which use variations of transformer encoders as backbones. To provide contrast, we also include embedding models that leverage large language model architectures, such as Qwen3-Embedding and e5-mistral-7b-instruct; for Qwen3-Embedding, both 0.6B and 8B parameter variants are tested to study the effect of scale. Overall, the embedding model category comprises 11 distinct models.

\textbf{Rerankers.}
Reranker models are typically employed in information retrieval, where they re-score candidate documents for relevance to a given query. The \textit{ms-marco-MiniLM-L6-v2} model serves as the reranker counterpart to \textit{all-MiniLM-L6-v2} and is used as the baseline for this group. Similarly, \textit{gte-reranker-modernbert-base} and \textit{bge-reranker-base/large} serve as reranking counterparts to their respective embedding models. We further include \textit{Qwen3-Reranker}, which outputs a relevance score between a document and a query by prompting the model to decide if the document is relevant. The probability assigned to the "yes" token (computed from the model’s vocabulary distribution using a softmax, with all other tokens masked out, except for "yes" and "no") is used as the final relevance score. Both the 0.6B and 8B variants of \textit{Qwen3-Reranker} are evaluated to analyze the impact of model size. In total, 6 reranker models are benchmarked.

Table A.2 in the technical appendix summarizes the models included in the experiments, listing their architecture, training data, and parameter count. In total, the benchmark covers 31 models.





\section{Experimental Setup}
\label{sec:exp_setup}

For our custom NLI-based cross-encoders, we follow the methodology of \citet{LaurerZSC2023} and train models on a mixture of MNLI \cite{Williams2018MNLI}, ANLI \cite{Nie2020ANLI}, WANLI \cite{Liu2022WANLI}, FEVER-NLI \cite{Thorne2018FEVER}, and LingNLI \cite{Parrish2021LingNLI}, datasets, deliberately omitting SNLI due to concerns regarding data quality and label bias. Appendix A.3 in the technical appendix provides further details on the training procedure.

To facilitate zero-shot classification, each class label is verbalized as a short, semantically clear, and context-rich description. For example, in the Amazon Polarity dataset, the positive class is verbalized as “The overall sentiment within the Amazon product review is \{label\},” where “label” is substituted with either “positive” or “negative” depending on the ground truth.

For reranker models, the text to be classified serves as the query, while the verbalized label descriptions are treated as candidate “documents” to be reranked according to their predicted relevance.

For nli-based cross-encoders we take the entailment logit and attribute the label with the highest logit as the predicted label. For embedding models, we compute the cosine similarity between the text embedding and each label embedding, selecting the label with the highest similarity score as the predicted label. For rerankers, we use the relevance score assigned to each label description to determine the predicted label.

% TODO <explain how we verbalize the label descriptions for rerankers>

\section{Results and Analysis}

In this section, we present and analyze the performance of all evaluated models on the BTZSC benchmark. Table~\ref{tab:tbl_main_benchmark} summarizes results across all datasets, grouped by task type, and reports (macro) F1 scores averaged within each task as well as overall, in addition to average (micro) accuracy. Standard deviations are included in parentheses to reflect variability across datasets.

\textbf{Base Transformer Encoders.}
Models that are not further fine-tuned or trained on specific semantic matching objectives perform poorly on zero-shot classification tasks. Their inability to align input texts with candidate label descriptions underscores the necessity of explicit training for semantic compatibility.

\textbf{NLI-based Cross-Encoders.}
Models fine-tuned on NLI data exhibit clear benefits over their off-the-shelf counterparts. Training on a diverse set of NLI datasets, including MNLI, ANLI, WANLI, FEVERNLI, and LINGNLI, yields consistently stronger performance compared to models such as \textit{bart-large-mnli} and \textit{nli-roberta-base}, with multi-dataset models achieving an average improvement of +6 F1 points across all tasks. Scaling model size further enhances performance: large variants outperform their base counterparts by an average of +3.5 F1 points. Figure~\ref{fig:fig_02_nli_performance_comparison} highlights this difference on a more granular level. Task difficulty remains a dominant factor: sentiment classification is relatively easy (median F1 $\approx$ 0.88–0.9), topic and intent classification are of intermediate difficulty (F1 $\approx$ 0.4–0.55), and emotion detection proves most challenging (F1 $\approx$ 0.25–0.35). Larger models deliver the greatest benefit for more difficult tasks, with performance gains especially pronounced in topic and intent classification. The choice of loss function, whether binary cross-entropy with neutral-collapsed or standard three-way cross-entropy, has minimal impact; the three-way cross-entropy variant closely tracks the regular large model with no consistent additional gain. Notably, within this family, \textit{deberta-v3-large-nli-triplet} achieves the highest overall performance, surpassing both the original BERT and ModernBERT variants, corroborating findings from \citet{ModernBERT2024}, that \textit{deberta-v3} is still a challenging baseline for various NLP tasks.

\begin{figure}[t]
    \centering
    \includegraphics[width=\linewidth]{./figs/fig_02_nli_performance_comparison.pdf} 
    \caption{Performance of NLI-based cross-encoders on BTZSC. Points are individual datasets; diamonds mark task-wise medians. Comparison against model size (base vs.\ large) and loss type (binary vs.\ three-way cross-entropy).}
    \label{fig:fig_02_nli_performance_comparison}
\end{figure}

\textbf{Reranker Models.}
Among rerankers, the baseline \textit{ms-marco-MiniLM-L6-v2} does not match the performance of NLI cross-encoders (average F1: 0.42), consistent with the historical view that NLI fine-tuning is advantageous for zero-shot tasks. However, more recent rerankers close the gap substantially. For example, \textit{gte-reranker-modernbert-base} achieves an average F1 of 0.58, just two points below the best NLI cross-encoder (\textit{deberta-v3-large-nli-triplet}), and with lower variance. The strongest reranker, \textit{Qwen3-Reranker-8B}, achieves an average F1 of 0.72 and outperforms all other models, including NLI cross-encoders, by significant margins (+12 F1 and +15 accuracy points). This model is the top overall performer on the benchmark, ranking first in three out of four task categories and second in emotion classification. It should be noted, however, that its size (8B parameters) far exceeds that of NLI cross-encoders (typically around 300M parameters). Importantly, even the much smaller \textit{Qwen3-Reranker-0.6B} delivers competitive results, surpassing NLI cross-encoders in accuracy and achieving the second-highest overall score (0.64), underscoring the strength of the reranker approach even at moderate scale.

\textbf{Embedding Models.}
The canonical embedding baseline, \textit{all-MiniLM-L6-v2}, attains an average F1 of 0.37, supporting prior observations that rerankers generally outperform embedding models in retrieval, albeit at higher computational cost. However, newer embedding models such as \textit{e5-large-v2}, \textit{gte-modernbert-base}, and \textit{gte-large-en-v1.5} achieve substantially higher F1 scores (0.60, 0.58, and 0.62, respectively), placing them on par with or even surpassing the best NLI cross-encoders. Notably, these embedding models lack cross-attention between documents and labels yet still deliver strong results at similar model sizes. For instance, \textit{gte-large-en-v1.5} ranks as the second-best model overall, but its performance still lags behind the top-ranked \textit{Qwen3-Reranker-8B} by about 10 F1 points. Scaling up embedding models does not yield the same improvements seen in rerankers; for example, \textit{Qwen3-Embedding-8B} only slightly outperforms its 0.6B variant (F1: 0.59 vs 0.58).

Figure~\ref{fig:fig_03_size_comparison} further elucidates scaling trends. Reranker models benefit substantially from larger scales, surpassing F1 of 0.70 at the highest parameter count (8B), while embedding models plateau at approximately 0.60 beyond 300M parameters. In the 100–300M range, performance between families is similar and variance is high, but at larger scales, rerankers gain a decisive advantage. Thus, rerankers are preferable when computational resources permit, whereas embedding models remain attractive for lightweight or latency-sensitive applications. Figure~\ref{fig:fig_04_performance_v_latency} plots model F1 score against normalized inference speed (1/wall time) on a standard test set. The upper right quadrant, bounded by the medians of both metrics, highlights models that best balance accuracy and efficiency. The majority of the models in this region are embedding models, indicating they offer the most favorable trade-off between performance and latency for practical deployments, with \textit{gte-reranker-modernbert-base} as the only reranker achieving comparable efficiency.


\begin{figure}[t]
    \centering
    \includegraphics[width=\linewidth]{./figs/fig_03_size_comparison.pdf} 
    \caption{Effect of scale on zero-shot performance. Macro-F$_1$ (BTZSC) versus parameter count (log scale). Error bands show 95\% confidence intervals.}
    \label{fig:fig_03_size_comparison}
\end{figure}

\begin{figure}[t]
    \centering
    \includegraphics[width=\linewidth]{./figs/fig_04_performance_v_latency.pdf} 
    \caption{Trade-off between model performance and inference speed. Macro-F$_1$ score (BTZSC) is plotted against normalized inference throughput ($1/$wall time) on a standard test set. The upper right quadrant, defined by the medians of both metrics, highlights models with the best balance of accuracy and efficiency.}
    \label{fig:fig_04_performance_v_latency}
\end{figure}


\subsection{NLI Performance as a Proxy for Zero-Shot Classification}

We also examine whether NLI task performance predicts zero-shot classification effectiveness. As shown in Figure~\ref{fig:fig_05_nli_vs_clf}, for NLI-tuned cross-encoders, there is a strong linear relationship: improvements in NLI AUROC directly translate into higher F1 on BTZSC, reflecting the transfer of entailment supervision. Rerankers, despite not being fine-tuned on NLI, also show a positive trend, indicating that a robust relevance or semantic-matching mechanism supports zero-shot classification. Notably, some rerankers achieve strong classification despite moderate NLI performance, highlighting their ability to capture discriminative task signals not present in standard NLI benchmarks. Embedding models, on the other hand, show tightly clustered NLI scores but a wider spread in classification F1, suggesting that well-structured embedding spaces can capture fine-grained topical distinctions that traditional NLI metrics may miss.

\begin{figure}[t]
    \centering
    \includegraphics[width=\linewidth]{./figs/fig_05_nli_vs_clf.pdf} 
    \caption{Relationship between NLI ability and zero-shot classification. Each dot is one model; x-axis shows AUROC on standard NLI benchmarks, y-axis macro-F$_1$ on BTZSC.}    
    \label{fig:fig_05_nli_vs_clf}
\end{figure}


\begin{table*}[t]
\centering
\tablesizeAAAI               % 9 pt font, \tabcolsep=3 pt
\renewcommand{\arraystretch}{1.1}

\begin{tabular}{@{}l c c c c @{\hspace{1.5em}} c c@{}}
\toprule
\textbf{Model} & \textbf{Topic} & \textbf{Sentiment} & \textbf{Intent} & \textbf{Emotion} & \textbf{Avg F1} & \textbf{Avg Acc} \\
\midrule
\multicolumn{7}{l}{\textbf{Base encoders}}\\
\underline{bert-large-uncased}         & 0.30 (0.22) & 0.38 (0.07) & 0.22 (0.29) & 0.09 (0.12) & 0.30 (0.20) & 0.40 (0.26) \\
deberta-v3-large           & 0.28 (0.24) & 0.34 (0.02) & 0.23 (0.31) & 0.05 (0.07) & 0.27 (0.20) & 0.36 (0.26) \\
ModernBERT-large           & 0.28 (0.24) & 0.36 (0.05) & 0.20 (0.24) & 0.03 (0.03) & 0.27 (0.20) & 0.35 (0.24) \\
\midrule
\multicolumn{7}{l}{\textbf{NLI cross-encoders}}\\
bart-large-mnli            & 0.36 (0.21) & 0.84 (0.19) & 0.47 (0.23) & 0.40 (0.06) & 0.51 (0.28) & 0.53 (0.28) \\
nli-roberta-base           & 0.40 (0.24) & 0.79 (0.15) & 0.31 (0.31) & 0.32 (0.03) & 0.49 (0.28) & 0.51 (0.27) \\
bert-base-uncased-nli      & 0.43 (0.26) & 0.76 (0.17) & 0.30 (0.40) & 0.24 (0.16) & 0.49 (0.29) & 0.51 (0.28) \\
bert-large-uncased-nli     & 0.49 (0.26) & 0.79 (0.10) & 0.35 (0.38) & 0.28 (0.22) & 0.54 (0.27) & 0.58 (0.27) \\
bert-large-uncased-nli-triplet & 0.49 (0.25) & 0.78 (0.12) & 0.34 (0.40) & 0.25 (0.06) & 0.53 (0.27) & 0.56 (0.26) \\
deberta-v3-base-nli        & 0.48 (0.25) & 0.86 (0.10) & 0.30 (0.24) & 0.33 (0.08) & 0.55 (0.28) & 0.58 (0.26) \\
deberta-v3-large-nli       & 0.47 (0.25) & \underline{0.90 (0.07)} & 0.52 (0.23) & 0.43 (0.03) & 0.59 (0.27) & 0.62 (0.26) \\
\underline{deberta-v3-large-nli-triplet} & 0.50 (0.26) & \underline{0.90 (0.07)} & 0.48 (0.34) & 0.43 (0.03) & 0.60 (0.28) & 0.62 (0.26) \\
modernbert-base-nli        & 0.47 (0.26) & 0.83 (0.14) & 0.29 (0.25) & 0.27 (0.02) & 0.54 (0.29) & 0.56 (0.29) \\
modernbert-large-nli       & 0.47 (0.24) & 0.86 (0.16) & 0.43 (0.29) & 0.29 (0.02) & 0.56 (0.28) & 0.60 (0.27) \\
modernbert-large-nli-triplet & 0.45 (0.24) & 0.88 (0.12) & 0.40 (0.30) & 0.35 (0.04) & 0.55 (0.29) & 0.58 (0.27) \\
\midrule
\multicolumn{7}{l}{\textbf{Rerankers}}\\
ms-marco-MiniLM-L6-v2      & 0.38 (0.16) & 0.59 (0.16) & 0.42 (0.27) & 0.19 (0.01) & 0.42 (0.19) & 0.46 (0.21) \\
gte-reranker-modernbert-base & 0.51 (0.13) & 0.82 (0.17) & 0.49 (0.22) & 0.42 (0.07) & 0.58 (0.20) & 0.62 (0.19) \\
bge-reranker-base          & 0.42 (0.13) & 0.61 (0.15) & 0.49 (0.01) & 0.30 (0.02) & 0.47 (0.16) & 0.49 (0.14) \\
bge-reranker-large         & 0.44 (0.17) & 0.77 (0.15) & 0.58 (0.01) & 0.36 (0.05) & 0.53 (0.21) & 0.55 (0.20) \\
Qwen3-Reranker-0.6B        & \underline{0.54 (0.23)} & 0.80 (0.20) & 0.56 (0.11) & 0.46 (0.07) & 0.61 (0.23) & \underline{0.64 (0.21)} \\
\underline{Qwen3-Reranker-8B}          & \textbf{0.66 (0.17)} & \textbf{0.93 (0.06)} & \textbf{0.72 (0.04)} & \underline{0.49 (0.01)} & \textbf{0.72 (0.19)} & \textbf{0.77 (0.15)} \\
\midrule
\multicolumn{7}{l}{\textbf{Embedding models}}\\
all-MiniLM-L6-v2           & 0.41 (0.11) & 0.35 (0.04) & 0.45 (0.03) & 0.13 (0.02) & 0.37 (0.12) & 0.44 (0.14) \\
e5-base-v2                 & 0.50 (0.18) & 0.83 (0.19) & 0.61 (0.00) & 0.40 (0.05) & 0.59 (0.23) & 0.62 (0.21) \\
e5-large-v2                & 0.50 (0.16) & 0.86 (0.17) & 0.57 (0.00) & 0.41 (0.05) & 0.60 (0.22) & 0.62 (0.20) \\
e5-mistral-7b-instruct     & 0.43 (0.22) & 0.88 (0.13) & \underline{0.66 (0.03)} & \textbf{0.50 (0.00)} & 0.58 (0.26) & 0.62 (0.24) \\
bge-base-en-v1.5           & 0.46 (0.19) & 0.82 (0.20) & 0.62 (0.03) & 0.35 (0.09) & 0.57 (0.24) & 0.59 (0.23) \\
bge-large-en-v1.5          & 0.42 (0.19) & 0.84 (0.19) & 0.61 (0.08) & 0.40 (0.06) & 0.55 (0.25) & 0.59 (0.24) \\
gte-base-en-v1.5           & 0.49 (0.21) & 0.83 (0.18) & 0.64 (0.03) & 0.38 (0.07) & 0.58 (0.24) & 0.61 (0.22) \\
\underline{gte-large-en-v1.5}          & \underline{0.54 (0.20)} & 0.85 (0.18) & 0.62 (0.04) & 0.38 (0.03) & \underline{0.62 (0.23)} & \underline{0.64 (0.21)} \\
gte-modernbert-base        & 0.46 (0.20) & 0.87 (0.12) & 0.64 (0.01) & 0.42 (0.04) & 0.58 (0.24) & 0.61 (0.23) \\
Qwen3-Embedding-0.6B       & 0.49 (0.13) & 0.81 (0.17) & 0.55 (0.15) & 0.42 (0.09) & 0.58 (0.20) & 0.61 (0.18) \\
Qwen3-Embedding-8B         & 0.46 (0.16) & \underline{0.90 (0.09)} & 0.55 (0.24) & \underline{0.49 (0.08)} & 0.59 (0.24) & \underline{0.64 (0.20)} \\
\bottomrule
\end{tabular}

\caption{Zero-shot classification results on BTZSC. We report macro-averaged F1 per task family and overall (Avg F1) and micro accuracy (Avg Acc). Standard deviations across datasets are in parentheses. Bold denotes the best and underlining the second-best score in each column. Best model in each family is underlined.}
\label{tab:tbl_main_benchmark}
\end{table*}




\section{Conclusion and Future Work}

This work presents the first comprehensive evaluation of zero-shot text classification across NLI cross-encoders, embedding models, and rerankers, leveraging the BTZSC benchmark to examine the strengths and limitations of each paradigm. Our findings demonstrate that reranker models achieve the highest overall accuracy, while strong embedding models offer the most favorable balance between speed and accuracy. In contrast, NLI cross-encoders, trail behind in both accuracy and efficiency. Scaling analyses reveal that rerankers uniquely benefit from larger model sizes, with significant gains observed beyond one billion parameters, whereas embedding models tend to plateau, suggesting different optimization ceilings across model families. Looking forward, several directions remain open for the community. Extending BTZSC to multilingual domains would facilitate evaluation under broader distribution shifts. Further investigation into the interplay between label verbalization strategies and model architectures could yield insights into improving zero-shot classification. Finally, exploring the performance and limitations of models that exceed the 8B parameter scale may provide a deeper understanding of scaling dynamics in this setting.

% References and end of paper
\bibliography{aaai2026}
\clearpage
\input{ReproducibilityChecklist.tex}
\end{document}
