# BTZSC: Benchmarking Zero-Shot Text Classification

**Supplementary Material for AAAI 2026 Submission**

This repository contains the implementation and experimental code for our paper "Revisiting Zero-Shot Classification: A Large-Scale Benchmark across Cross-Encoders, Rerankers and Embedding Models" submitted to AAAI 2026.

## Overview

This work presents **BTZSC** (Benchmark for Textual Zero-Shot Classification), a comprehensive evaluation suite for zero-shot text classification across three major model families:

- **NLI-based Cross-Encoders**: Models fine-tuned on Natural Language Inference datasets
- **Embedding Models**: Dense text representation models using similarity-based classification  
- **Reranker Models**: Information retrieval models adapted for classification tasks

We evaluate 31 models across 22 diverse datasets spanning sentiment, topic, intent, and emotion classification tasks.

## Key Contributions

1. **Comprehensive Benchmark**: First systematic comparison of NLI cross-encoders, embedding models, and rerankers in true zero-shot settings
2. **Diverse Evaluation**: 22 datasets covering multiple domains, class cardinalities, and document lengths
3. **Novel Insights**: Rerankers achieve highest accuracy, embedding models offer best speed-accuracy trade-off, scaling benefits vary by model family
4. **Reproducible Research**: Complete codebase with training scripts, evaluation pipelines, and analysis tools

## Repository Structure

```
├── src/                    # Core implementation
│   ├── clcp/              # Zero-shot classification pipeline
│   └── ml_utils/          # ML utilities and tracking
├── scripts/               # Experiment scripts
│   ├── main.py           # Main training script
│   ├── eval/             # Evaluation scripts
│   ├── paper/            # Paper analysis scripts
│   └── onboarding/       # Data preparation scripts
├── models/               # Trained model checkpoints
├── hydra/               # Configuration management
│   └── configs/         # Experiment configurations
├── paper/               # LaTeX paper source
└── pyproject.toml       # Project dependencies
```

## Quick Start

### Installation

This project uses [uv](https://github.com/astral-sh/uv) for dependency management:

```bash
# Install uv if not already installed
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone the repository
git clone <repository-url>
cd paper-aaai26

# Install dependencies
uv sync
```

### Requirements

- Python ≥ 3.11
- PyTorch ≥ 2.6.0
- Transformers ≥ 4.51.3
- Datasets library
- MLflow for experiment tracking

### Basic Usage

#### 1. Training Custom NLI Models

Train a custom NLI-based cross-encoder:

```bash
uv run python scripts/main.py model.backbone=aarabil/bert-base-uncased model.arch=cross-encoder-nli-triplet
```

#### 2. Evaluating Models

Evaluate models on the BTZSC benchmark:

```bash
uv run python scripts/eval/eval.py
```

#### 3. Reproducing Paper Results

Run the complete evaluation pipeline:

```bash
# Generate benchmark results
uv run python scripts/paper/retrieve_eval_data.py

# Create analysis plots
uv run python scripts/analyse.py
```

## Model Families Evaluated

### NLI-based Cross-Encoders (11 models)
- BART-Large-MNLI
- NLI-RoBERTa-base  
- Custom models: BERT, DeBERTa-v3, ModernBERT (base & large variants)
- Training variants: standard NLI loss vs. triplet loss

### Embedding Models (9 models)
- Sentence-Transformers: all-MiniLM-L6-v2
- E5 family: e5-base-v2, e5-large-v2, e5-mistral-7b-instruct
- BGE family: bge-base-en-v1.5, bge-large-en-v1.5
- GTE family: gte-base-en-v1.5, gte-large-en-v1.5, gte-modernbert-base
- Qwen3-Embedding: 0.6B, 8B variants

### Reranker Models (6 models)
- MS-MARCO-MiniLM-L6-v2
- BGE-reranker: base, large variants
- GTE-reranker-modernbert-base
- Qwen3-Reranker: 0.6B, 8B variants

### Base Transformer Encoders (5 models)
- BERT-large-uncased
- DeBERTa-v3-large  
- ModernBERT-large
- (Used as baselines without task-specific fine-tuning)

## BTZSC Dataset Details

The benchmark comprises 22 English datasets across four task families:

### Sentiment Classification (6 datasets)
- Amazon Polarity, IMDB, Rotten Tomatoes, App Reviews, Financial Phrase Bank, Empathetic

### Topic Classification (6 datasets)  
- AGNews, Manifesto, Banking77, Massive, TrueTeacher, WikiToxic Identity Hate

### Intent Classification (5 datasets)
- BiasFrames (Intent, Offensive, Sex), Capsotu, WellFormedQuery

### Emotion Classification (5 datasets)
- EmoContext, EmotionDAIR, HateXplain, HateOffensive, Spam

**Dataset Statistics:**
- Class range: 2-77 labels
- Document length: 8-280 tokens (average)
- Domains: News, social media, reviews, encyclopedic, political


## Configuration

The project uses Hydra for configuration management. Key configuration files:

- `hydra/configs/config.yaml`: Main experiment configuration to train models
- `hydra/configs/mode/eval.yaml`: Main experiment configuration to evaluate models


### Example Configuration

```yaml
model:
  arch: cross-encoder-nli-triplet
  backbone: aarabil/bert-large-uncased

data:
  train: nli_triplet
  test: [nli_triplet]

pretraining:
  batch_size: 32
  epochs: 3
  lr_backbone: 8e-6
  lr_head: 4e-5
```

## Experiment Tracking

This project uses MLflow for experiment tracking. All runs are logged with:

- Model hyperparameters
- Training metrics and loss curves  
- Evaluation results on all test datasets
- Model artifacts and checkpoints

## Reproducibility

### Data Preparation
All datasets are preprocessed and available through HuggingFace Datasets:
- Training data: `aarabil/clcp_nli_triplet` 
- Evaluation datasets: Individual dataset configurations under `aarabil/clcp_*`

### Model Checkpoints
Trained models are saved in the `models/` directory with:
- Model weights (`head_state.pt`)
- Tokenizer configuration
- Training metadata (`meta.json`)

### Hardware Requirements
- GPU recommended for training and inference (experiments conducted on A100)
- ~16GB disk space for datasets and model checkpoints

