# ATAD: Agent-Centric Textual Anomaly Detection benchmark protocol

This repository contains the official implementation of ATAD, an Agent-Centric Textual Anomaly Detection benchmark protocol.

---

## 🔍 Overview

Traditional evaluation of large language models (LLMs) relies on static datasets, which are limited in scalability and fail to capture evolving reasoning abilities.

**ATAD** introduces a dynamic, agent-centric protocol where:

A **Teacher** agent generates problems.

An **Orchestrator** validates them and ensures coherence/fairness.

A **Student** agent attempts to solve them.

If the student succeeds, the orchestrator requests a harder version; if not, the problem is finalized. This process enables **difficulty scaling** without human curation.

The benchmark uses **text anomaly detection** tasks requiring logical reasoning across multiple sentences, revealing weaknesses missed by traditional benchmarks.

---

## 📁 Repository Structure

```bash
.
├── atad_code/
│   ├── generation/          # Benchmark generation code
│   ├── evaluation/          # Benchmark evaluation code
│   └── requirements.txt     # Required Python packages   
├── atad_dataset/            # Benchmark data used in the paper
└── README.md                # This file
```

---

## 🚀 Getting Started
### 1. Install dependencies
```bash
pip install -r requirements.txt
```
### 2. Run example benchmark generation
```bash
python atad_code/generation/orchestrator_agentic_generator.py --config config.yaml
```
### 3. Evaluate model output
```bash
python atad_code/evaluation/eval_agentic_models.py --config eval_config.yaml --dataset 'path/your/data'
```
---
## 📊 Benchmark Data

The atad_dataset directory contains the official benchmark datasets used for the experiments in our paper. This is the specific benchmark data applied in Appendix B and Table 5. Furthermore, the main results reported in Table 1 of our paper represent the average performance of models evaluated across these four LLM-generated benchmark sets.

The structure is as follows:

```bash
atad_dataset/
├── claude-3.5-sonnet/
│   ├── atad_claude-3.5-sonnet_base.jsonl
│   └── atad_claude-3.5-sonnet_final.jsonl
├── gemini-2.0-flash/
│   ├── atad_gemini-2.0-flash_base.jsonl
│   └── atad_gemini-2.0-flash_final.jsonl
├── gpt-4o/
│   ├── atad_gpt-4o_base.jsonl
│   └── atad_gpt-4o_final.jsonl
└── llama-3.3-70b/
    ├── atad_llama-3.3-70b_base.jsonl
    └── atad_llama-3.3-70b_final.jsonl
```

For each model:

- *_base.jsonl: Contains the base problems. These are problems initially generated by the Teacher agent and subsequently validated by the Orchestrator, representing the state before difficulty is escalated through interaction with the Student agent.

- *_final.jsonl: Contains the refined and validated problems after the full agentic generation process (Teacher-Orchestrator-Student interaction). These _final files collectively constitute the final ATAD benchmark.

---
## 📦 Dataset Distribution

The full dataset and Croissant metadata will be made available publicly upon acceptance.

We adhere to standard dataset hosting guidelines, including:
- Validated Croissant metadata
- Publicly accessible storage
- Consistent versioning and permanence

A direct link to the dataset on a public repository (e.g., Hugging Face Hub) will be provided here upon acceptance.