# OE-Safety Classifiers

This module defines a standard interface for safety classification and implements models under that interface.

## Classifier Interface

The main interface is defined by the `SafetyClassifierBase` and `SafetyClassifierOutput` classes in `base.py`.

`SafetyClassifierOutput` consists of a superset of all the fields that can be returned by a safety classifier. It is typed such that output fields are optional (`type | None`). The `SafetyClassifierOutput.get_fields_and_types()` static method returns a mapping from each field name to its (non-`None`) type. The `asdict()` method converts from the object into a native python dictionary.

If you are creating a classifier with new output fields, you should add them to the `SafetyClassifierOutput` class.

### Key details for classifiers

All classifiers subclass from `SafetyClassifierBase`. Subclasses must provide the following implementations:
- `__init__`: Should provide a *default value for batch size* and call `super().__init__()` along with any other required initialization.
- `get_required_input_fields`: Should return a list of required input field names as strings.
- `get_optional_input_fields`: Optionally can override to return a list of optional field names; default is empty list.
- `get_output_fields`: Should return a list of all output fields as strings.
- `_classify_batch`: Define the core classification logic on a batch of inputs. Should return a list of `SafetyClassifierOutput` objects.

## Using Classifiers

The main entrypoint to use safety classifiers is `load_classifier_model` in `loader.py`. This factory function takes a model name and any necessary kwargs, and returns an initialized model ready to call `.classify`.

Check the class definition of the model you want to load for required and optional init arguments to pass to the loader; for example, `AI2SafetyRequestClassifierV0Base` models require `local_model_path`. All parameters should be passed as keyword arguments.

To quickly run a classifier on a dataset, you can use the dataflow syntax. Assuming your data is in standard JSONL format, you can run this command from the root of OE-Safety (note that the quotes are required):

```bash
python data_pipelines/dataflow/dataflow.py run "'{input_filepath}' >> ReadDataNode() >> ClassifierNode('{model_name}') >> WriteDataNode() >> '{output_filepath}'"
```

### Supported Classifiers

Currently, we support the following classifiers. Note that `loader.py` is the single source of truth for the exhaustive list of available classifiers.
- Local model classifiers (including LlamaGuard and AI2 custom trained classifiers)
- OpenAI model based classifiers
- API classifiers (currently OpenAI Moderation API)

## OpenAI Model Classifiers

Because OpenAI model based classifiers are particularly uniform in logic, we add an abstracted config for these, defined in `openai_model_safety_classifier_configs.py`. The `OpenAIModelSafetyClassifierConfig` schema defines a classifier based on the minimal parameters consisting of an instruction and output field descriptions, with optional parameters of batch size and OpenAI model. This class uses OpenAI's structured JSON output mode and works with `SafetyClassifierOutput` to infer the types of the output fields.

When loading an OpenAI classifier, you can provide the model name of a specific classifier in `load_classifier_model` to use a predefined config, or provide the model name as `OpenAIModelSafetyClassifier` with either `config_path` pointing to a config YAML or an `OpenAIModelSafetyClassifierConfig` object in the `config` keyword argument to instantiate one with custom behavior. Note that the `config_path` should be an absolute path, or it will be interpreted as a relative path from the `src/classifier_models/` directory.

Configs can be defined using YAML format (preferred) or in code directly. See the `config_openai/` folder for examples. A minimum required YAML config is as follows:

```yaml
instruction: "Classify this prompt: {prompt}\n{output_description}"
output_fields:
  prompt_harmfulness: "Is the prompt harmful?"
```

This will be formatted into the following complete prompt sent to the model:

```
Classify this prompt: {prompt}
prompt_description: Is the prompt harmful?

Respond directly in JSON format (without Markdown code blocks or other formatting).
The JSON should follow this schema:
{
  "prompt_harmfulness": "str (harmful/sensitive/unharmful)"
}
```

### Key details for OpenAI configs

- Any inputs (e.g. prompt and/or response) must be included as string format variables, like `"{prompt}"` and `"{response}"`. The `instruction` field must also contain the string `"{output_description}"` inside, which will be substituted with the JSON schema for the model to return. Do not directly include formatting instructions, as these will be added automatically.
- The keys in `output_fields` must correspond to fields in `SafetyClassifierOutput`, and the values should be plaintext descriptions of the output fields to instruct the model.
- When defining config in YAML, for the `instructions` string follow the format of the existing configs. Note that `\` before a newline in the string will prevent the parser from adding whitespace between lines.
