# Description

This repository contains code accompanying the submission: "Answer Matching Outperforms Multiple Choice for LLM Evaluations" submitted to NeurIPS 2025.

# Guide

The folder structure is as follows:

lmeval/ - consists of files added or changed in lmeval harness. New tasks (like free form or multiple choice verification) are placed in respective folder to enable easier navigation and reproducability of results.

annotations/ - contains human annotations for the dataset.

mcq_classifier/ - consists of code files needed for the results demonstrating discriminative shortcuts

src/ - consists of code files needed for data filtering, evaluations, and analysis plots for other results

### src code

Main useful folders being:

- `src/judge_w_gt/` - contains implementation for running judges or matches on model responses.
- `src/query_models/` - contains implementation for querying models in both free form and multiple choice format.

Extra folders:

- `src/visualize_resps/` - contains implementation for the annotation interface.
- `src/cost_analysis/` - contains implementation for cost analysis of models outputs, matchers and judges.
- `src/filtering/` - contains helper scripts for filtering the dataset.
- `src/format-analysis/` - contains plotting scripts.

---
## Installation

```
uv venv qa 
source qa/bin/activate
uv pip install torch --index-url https://download.pytorch.org/whl/cu121
cd lmeval
uv pip install -e ".[api,wandb,vllm,dev]"
```
