# PubHealthBench

This is a fork of the [MMLU-Pro](https://github.com/TIGER-AI-Lab/MMLU-Pro/tree/main) evaluation code, refactored for running PubHealthBench.

## Introduction

PubHealthBench is a benchmark designed to provide a broad assessment of LLM knowledge of current UK Government public health guidance. PubHealthBench contains over 8000 questions relating to public, clinical, and professional guidance across 10 public health topic areas, sourced from 687 documents from the UK Government website (gov.uk) on 08/01/2025.

## Dataset Creation (benchmark_dataset folder)
We use a large corpus of 1,150 current UK Government guidance documents from the UK Government website (gov.uk) in HTML and PDF formats as the source material. To generate PubHealthBench we develop an automated pipeline to extract free text from documents, chunk it into sections, generate MCQA samples, and filter to a high quality subset. See the paper for full details.

IMPORTANT
Please note:

This dataset should not be used as a source of UK Government public health information, for all public health queries please seek the up to date guidance directly from the relevant organisation.

1. UK Government guidance is updated reguarly, therefore some information used in this benchmark may be out of date.
2. To generate PubHealthBench we extracted text from HTML and PDF documents, in some cases there may be errors in the extracted text or parts of the text missing. Please refer to the official guidance documents if in doubt.
3. This dataset does not represent an exhaustive list of UK Government public health guidance.
4. Questions and answer options were generated by a Large Language Model. Please see accompanying publication for details.
5. Some questions or answers may be erroneous, please flag any questions containing potential problems to the issues page with the 'question_id'.

### Dataset Details

Language(s) (NLP) - English

License - CC-by-4.0 (contains public sector information licensed under the Open Government Licence v3.0. - see below)

### Licensing
The dataset as a whole is released under CC BY‑4.0.

Rows in the 'source_chunk_text' and 'retrieved_context_for_judge' columns incorporate Crown‑copyright material that is from public‑sector documents. They remain under the Open Government Licence v3.0.

## API Evaluation

To use the API for inference, set your API_KEY as an environment variable then run:

**MCQA Setup:**

```bash
python evaluate_from_api.py \  
                 --model_name gpt-4o-mini \
                 --output_dir eval_results_reviewed_mcqa \
                 --assigned_subjects all \
                 --subset reviewed \
                 --setup mcqa

```

**Free Form Setup:**

To use the API for inference, set your API_KEY and JUDGE_API_KEY as environment variables then run:

```bash
 python evaluate_from_api.py \  
                 --model_name gpt-4o-mini \     
                 --output_dir eval_results_reviewed_freeform \
                 --assigned_subjects all \
                 --subset reviewed \
                 --setup freeform \
                 --judge_model_name gpt-4o-mini-2024-07-18
```

### Overall Accuracy
```
python compute_accuracy.py eval_results_reviewed_mcqa/
```

## Local Evaluation

To use local inference, ensure you have vLLM installed and run:

**MCQA Setup:**

```bash
python evaluate_from_local.py \  
                 --model google/gemma-3-1b-it \
                 --save_dir eval_results_reviewed_mcqa \
                 --selected_subjects all \
                 --subset=reviewed 
```

Note we do not currently provide a fully local implementation of the free form setup as we use gpt-4o-mini-2024-07-18 as the judge.

