# On Calibration of LLM-based Guard Models for Reliable Content Moderation


## Overview

Large language models (LLMs) are exposed to significant risks due to their potential for malicious use. Existing studies have developed LLM-based guard models designed to moderate the input and output of threat LLMs, ensuring adherence to safety policies by blocking content that violates these protocols upon deployment. However, limited attention has been given to the reliability and calibration of such guard models. In this work, we empirically conduct comprehensive investigations of confidence calibration for 9 existing LLM-based guard models on 12 benchmarks in both user input and model output classification. Our findings reveal that current LLM-based guard models tend to 1) produce overconfident predictions, 2) exhibit significant miscalibration when subjected to jailbreak attacks, and 3) demonstrate limited robustness to the outputs generated by different types of response models. Additionally, we assess the effectiveness of post-hoc calibration methods to mitigate miscalibration. We demonstrate the efficacy of temperature scaling and, for the first time, highlight the benefits of contextual calibration for confidence calibration of guard models, particularly in the absence of validation sets. Our analysis and experiments underscore the limitations of current LLM-based guard models and provide valuable insights for the future development of well-calibrated guard models toward more reliable content moderation. We also advocate for incorporating reliability evaluation of confidence calibration when releasing future LLM-based guard models.

## Quick Start


### Installation

```bash
cd reliable_content_moderation
pip install -r requirements.txt
```

### Running The Evaluation 
For general evaluations and experiments with improved calibration, 

```bash
bash ./run_eval.sh
```

In the script `./run_eval.sh`, you can modify the config, `mode`: prompt or response classification, `dataset`: dataset to be examined, `cls_path`: guard model card to be examined, `cal_method`: calibration method (provide additional temperature value for `ts`). 

For response model-dependent evaluations and experiments with improved calibration, 

```bash
bash ./run_eval_model_dep.sh
```

In the script `./run_eval_model_dep.sh`, experiments are only carried out for **reponse classification** on **harmbench-adv-model** (subsets of the harmbench-adv set). You can only modify the config, `cls_path`: guard model card to be examined, `cal_method`: calibration method (provide additional temperature value for `ts`). 

 

## Acknowledgements 

This repository is based on [HarmBench](https://github.com/centerforaisafety/HarmBench) and [Verified Uncerntainty Calibration](https://github.com/p-lambda/verified_calibration)
