# Reproduction Package

To reproduce the results of our AutoPDL work, these steps will need to be taken:

1. Setup the environment and package
2. Set environment variables
3. Download and preprocess the datasets
4. Run experiments
5. Analyze results


## Environment and package

Setup a virtualenv with your tool of choice e.g. `conda`, then install this package:
```bash
pip install -U --upgrade-strategy=eager '.[all]'
```

## Environment variables

In your .bashrc or the top of the experiment `.sh` scripts, set the following environment variables

```bash
export WATSONX_URL=https://us-south.ml.cloud.ibm.com  # region dependent
export WATSONX_API_KEY=     # set your wastonx key
export WATSONX_PROJECT_ID=  # set your watsonx project id
export OPENAI_API_KEY=      # set your OpenAI key
export EVALPLUS_MAX_MEMORY_BYTES=43980465111040
```


## Download & process datasets

We have uploaded our preprocessed datasets to zenodo, available [here](https://zenodo.org/records/15115491/files/datasets.zip?download=1) to run all experiments. Please extract the archive into a folder named `var/` at the root of this folder, so that the dataset folders are directly under `var/`.


Should you wish to run our preprocessing steps, please follow these instructions:

### Preprocessing
Download the following files:
```
# BIG-Bench FEVER
https://raw.githubusercontent.com/google/BIG-bench/refs/heads/main/bigbench/benchmark_tasks/fact_checker/fever/task.json

# FEVER
https://fever.ai/download/fever/shared_task_dev.jsonl
https://fever.ai/download/fever/shared_task_test.jsonl
https://fever.ai/download/fever/wiki-pages.zip
```

The remaining datasets are downloaded using HuggingFace Datasets.

Run `process_gsm8k.py` and `process_mbpp.py`. For FEVER and GSM-Hard, run the notebooks `fever_json.ipynb` and `gsmhard.ipynb`.

## Run exeperiments

To run the experiments, run `run_all_experiments.sh`.

## Analyze results

Resulting data is processed to produce tables and figures in `plotting_exp.ipynb`.

# Running a scaled down experiment
As the full optimization runs require a lot of resources, we have prepared a scaled down experiment to test the executability of our reproduction package. To run this example:

```bash
python -m pdl.optimize.optimize --config exp_configs/gsm8k/granite_3_8b_instruct_gsm8k_opt_small.yml examples/prompt_library/exp/gsm8k/general.pdl
```

We recommend using Ollama as described below if watsonx inference is not a suitable option.

# Optimizer/experiment configurations

The PDL prompt pattern library is located in `examples/prompt_library`, and the programs for each dataset can be found in the `/exp` subfolder. The optimized PDL program will be output as `optimized_<input file>.pdl`. The resulting programs from our experiments can be found in `optimized_pdl` folder in the root.

See `exp_configs` for the YAML files. They can be modified as needed and rerun following the commands in `run_all_experiments.sh` e.g.:
```
python -m pdl.optimize.optimize --config <experiment_configuration.yml> <input_pdl_program.pdl>

python -m pdl.optimize.optimize --config exp_configs/gsmhard/granite_3_8b_instruct_gsmhard_opt.yml examples/prompt_library/exp/gsm8k/general.pdl
```

For example, you may wish to avoid a hosted inference service and instead use e.g. Ollama locally. To do some, first install [Ollama](https://ollama.com/) and run the model e.g. `ollama run granite3-dense:8b`, then simply modify the `model` variable (see LiteLLM [docs](https://docs.litellm.ai/docs/providers/ollama#using-ollama-apichat) for more details) in the YAML experiment/optimizer configuration e.g.:
```
variables:
  model:
  - watsonx/ibm/granite-3-8b-instruct
```

to

```
variables:
  model:
  - ollama_chat/granite3-dense:8b
```

# Generating Trajectories
Excerpt from supplemental material:
## Agent Trajectory Construction

We create a basic agentic trajectory `traj_i` for each training example ⟨xᵢ, yᵢ⟩, following a rule-based transformation outlined below.

### GSM8K

To demonstrate tool use in ReAct, we derive a trajectory `traj` as follows. We exploit the fact that there is at most one expression per reasoning step by iterating through the steps. At each step, we append a *thought* to the trajectory, consisting of the text leading up to the math expression, concatenated with a reflection: "I need to calculate". We append a calculator tool call with the expression, and an *observation*—i.e., the result of the expression. Finally, we append a thought "The answer is ...", containing the ground truth answer, followed by the *finish* action with the answer.

We follow the same procedure to create ReWOO trajectories, except we use slightly different wording (e.g., "Calculate xyz" in place of "I need to calculate xyz") and omit the final thought and action. Additionally, we use string substitution to replace any assumed expression results in the trajectory with the corresponding variable.

### FEVER

To produce agent trajectories, we iterate over each article associated with a claim, append a thought "I need to search for ...", followed by the action, an observation containing the article summary, and finally a thought containing all the relevant sentences associated with that article for that claim. We repeat this for each article associated with a claim.

This procedure is not ideal as there is no inherent order to the articles or sentences, even though there may be a natural ordering following the annotator's Wikipedia navigation. Finally, we append a thought "The claim is true/false" and the *finish* action, both with the ground truth answer. For chain-of-thought, we perform the same procedure except we only include the concatenated evidence sentences, as there is no tool use.

### MBPP+

To generate sample agent trajectories from the training set, we follow the agent pattern (without feedback) in-context examples by Wang et al. (2024), which consists of the problem `x`, a thought such as "The intersection is elements that are in both lists", an *execute* action that contains proposed code *and* an assertion calling the proposed method with the test case input from the prompt and comparing its output. This is then followed by an *observation* containing the execution result, i.e., either `"Executed Successfully with No Output"` or a stack traceback. This allows the agent to iterate on solutions (up to five times in our implementation).

We use the full MBPP train set of 374 problems as `D_train`, and split the MBPP+ dataset into `D_valid` and `D_test` based on problem ID membership in MBPP, leaving 39 and 224 validation and test problems respectively.

To generate synthetic trajectories from the training set, we start with the natural language specification and single test case (the prompt), append the thought "I should run a solution on the test case before proposing a solution.", followed by the ground truth solution and substitute in the prompt test case following the pattern:

```python
[solution]res = ...; assert res == ..., "Expected ... but got {}".format(res)