
## Detecting ambiguity in context-free question

This is the official implementation for the paper: "CLARA: Clarification-Driven Measurement of Input Ambiguity in LLMs"

### Requirements

The dependency packages can be found in the `requirements.txt` file. You can use:

```sh
pip install -r requirements.txt
```

to configure the environment. We use Python 3.8 to run the experiments.

### Prepare the Data

Run the following script to prepare the data:

```sh
python tools/prepare_data.py 
```
### Running the full pipeline 

To execute the entire pipeline, run the following commands

```sh
for i in {0..4}
do
    python tools/generate_clarification.py \
        --dataset_name ambigqa \
        --output_path logs/clarification/ambigqa/ambigqa_${i}.json \
        --sample \
        --sample_n 1 \
        --model gpt-4o-mini-2024-07-18
done


python evaluate_CLARA_ambigqa.py \
       --log_path logs/clarification/ambigqa \
       --output_path logs/CLARA/ambigqa/ambigqa_clara.json \


python tools/compute_CLARA_ambigqa.py
```

### Running the experiments

The overall pipeline is: generate clarifications → calculate CLARA scores → evaluate the performance (either mistake detection or ambiguity detection).
Before running the experiments, configure your API key and choose your model in the `src/common.py` file.

1. Use the following script *N* times to generate *N* batches of clarifications:

```sh
python tools/generate_clarification.py --dataset_name ambigqa --output_path logs/clarification/ambigqa/ambigqa_i.json --sample --sample_n 1 --model "gpt-4o-mini-2024-07-18"
```

`model`: choices include `o4-mini-2025-04-16`, `gpt-4o-mini-2024-07-18`,`llama-4-scout-17b-16e-instruct`, `qwen3-32b`,`magistral-small-2506`.
`dataset_name`: choices include `ambigqa`, `ambig_inst`.
For `ambigqa`, you have the choice between two prompts (diversified and regular). You can switch between them in `src/prompt_util.py`. We recommend using the *diversified* prompt for best performance.

2. Then, calculate the CLARA scores based on the generated clarifications:

```sh
python evaluate_CLARA_ambigqa.py --log_path logs/clarification/ambigqa --output_path logs/CLARA/ambigqa/ambigqa_clara.json
```

(for `ambigqa`)

or

```sh
python evaluate_CLARA_ambiginst.py --log_path logs/clarification/ambiginst --output_path logs/CLARA/ambiginst/ambiginst_clara.json
```

(for `ambig_inst`)

We calculate two scores:

* **CLARA**, which measures the sum of *antisimilarity* between clarifications and aggregates the results across the *N* batches;
* **CLARAOQ**, which weights these scores by the similarity between each clarification and the original question.

You can set the hyperparameter `MAX_CLARIFICATIONS` at the beginning of the script to limit the number of clarifications taken into account per question.

3. Then evaluate the performance using the evaluation scripts in the `tools/` directory, such as:

```sh
python tools/compute_CLARA_ambigqa.py
```

(for `ambigqa`)

or

```sh
python tools/compute_CLARA_ambiginst.py
```

(for `ambig_inst`)

**Note:** If you want to change the model used for clarification generation, you must insert your `groq` or `mistral` API keys and replace the response function with the appropriate one.

Some parts of our experimental implementation are adapted from: [https://github.com/UCSB-NLP-Chang/llm\_uncertainty](https://github.com/UCSB-NLP-Chang/llm_uncertainty)

### Citation



