# Offline Policy Evaluation Using the Command Line

```{admonition} Credit
Thank you [@maxpagels](https://github.com/maxpagels) for [contributing](https://github.com/VowpalWabbit/vowpalwabbit.github.io/pull/193) this tutorial.
```


Offline policy evaluation (OPE) is an active area of research in reinforcement learning. The aim, in a contextual bandit setting, is to take bandit data generated by some policy (let's call it the _production policy_) and estimate the value of a new _candidate policy_ offline. The use case is clear: before you deploy a policy, you want to estimate its performance, and compare to what's already deployed. This tutorial walks you through how to evaluate new policies in a variety of scenarios: in batch learning settings, where deployed policies stay the same until replaced; and in online settings, where you plan to deploy a policy that learns in real time as new data comes in.

This tutorial is text-heavy, but we encourage you to read through it carefully to make the most of OPE.

## Introduction & Motivation

In supervised learning settings, the standard approach to offline evaluation is to train on a train set and estimate generalisation performance on a holdout set. In online learning settings, one typically uses progressive validation. In contextual bandit settings, neither is directly possible, because like all reinforcement learning, there is a partial information problem: you never get to see rewards of actions you didn't take. Your only source of information is the bandit data generated by your production policy, which might make entirely different choices than your candidate policy.

It doesn't, then, seem possible to reliably evaluate contextual bandit policies offline. But it is! The key is to use estimators that fill in fake rewards for actions that weren't taken, thereby creating a "fake" supervised learning dataset, against which you can estimate performance, either using progressive validation (more on that later) or a holdout set.

VW implements several estimators to reduce policy evaluation to supervised learning-type evaluation. The simplest method, the direct method (DM), simply trains a regression model that estimates the cost (negative reward) of an (action, context) pair. As you might suspect, this method is generally biased, because the partial information problem means you typically see many more rewards for good actions than bad ones (assuming your production policy is working normally). Biased estimators should not be used for offline policy evaluation, but VW implements provably unbiased estimators like inverse propensity weighting (IPS) and doubly robust (DR) that can be used for this purpose.

Finally, before we get into how to run offline policy evaluation in VW, note that in this tutorial, by policies we mean contextual bandit models, not the exploration layer (e.g. epsilon-greedy) that is usually part of a contextual bandit system to tackle the explore-exploit tradeoff.

For now, If you wish to evaluate the performance of the entire loop (model + exploration), please refer to the documentation for [`--explore_eval`](https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Explore-Eval). It is useful if you want to understand how different types of exploration might lead to better future rewards in an online learning bandit system.

## Batch scenario: policy evaluation with a pre-trained VW policy, `cb`-format data

Let's say you have collected the following bandit data from your production policy, and that the data is ordered such that the oldest data is first (don't worry about the actual numbers in this toy example):

    1:1:0.5 | user_age:25
    2:0:0.5 | user_age:25
    2:1:0.5 | user_age:55
    2:1:0.5 | user_age:56
    2:1:0.5 | user_age:55
    2:1:0.5 | user_age:56
    1:1:0.5 | user_age:27
    2:0:0.5 | user_age:21
    2:0:0.5 | user_age:23
    2:0:0.5 | user_age:56
    2:1:0.5 | user_age:55
    2:1:0.5 | user_age:36
    1:1:0.5 | user_age:25
    2:0:0.5 | user_age:25
    2:1:0.5 | user_age:55
    2:1:0.5 | user_age:56
    2:1:0.5 | user_age:55
    2:1:0.5 | user_age:56
    1:1:0.5 | user_age:27
    2:0:0.5 | user_age:21
    2:0:0.5 | user_age:23
    2:0:0.5 | user_age:56
    2:1:0.5 | user_age:55
    2:1:0.5 | user_age:36

In order to do OPE, it is useful to think carefully about what you wish to evaluate. In a batch setting, you are interested in training a candidate policy and evaluating its performance on unseen data that is fresher than what you trained on. So, let's split our data into two files. Starting from the oldest data first, we do e.g. and 70%/30% split, and save the results as `train.dat` and `test.dat`:

`train.dat`:

    1:1:0.5 | user_age:25
    2:0:0.5 | user_age:25
    2:1:0.5 | user_age:55
    2:1:0.5 | user_age:56
    2:1:0.5 | user_age:55
    2:1:0.5 | user_age:56
    1:1:0.5 | user_age:27
    2:0:0.5 | user_age:21
    2:0:0.5 | user_age:23
    2:0:0.5 | user_age:56
    2:1:0.5 | user_age:55
    2:1:0.5 | user_age:36
    1:1:0.5 | user_age:25
    2:0:0.5 | user_age:25
    2:1:0.5 | user_age:55
    2:1:0.5 | user_age:56

`test.dat`:

    2:1:0.5 | user_age:55
    2:1:0.5 | user_age:56
    1:1:0.5 | user_age:27
    2:0:0.5 | user_age:21
    2:0:0.5 | user_age:23
    2:0:0.5 | user_age:56
    2:1:0.5 | user_age:55
    2:1:0.5 | user_age:36

Before continuing, it is worth understanding that policy value estimators such as IPS, DM and DR aren't only useful for policy value estimation. Since they provide us a way to fill in fake rewards for untaken actions, they allow us to reduce bandit learning to supervised learning, and used to _train_ policies. For example, say you have a (biased) DM estimator. For each untaken action per round, you can predict a reward, thus forming a supervised learning example where the loss of each action is known (estimated). You can then train an importance-weighted classification model, or even a regression model that estimates costs of arms given contexts, and use these models as policies. This is, in fact, what VW does: estimators serve a dual purpose and are used not only for evaluation, but also optimisation/training.

Now that we know that the same estimators that are used for OPE can also be used to train policies, let's train a new candidate policy on the train set using e.g. IPS, and save the model as `candidate-model.vw`. In this instance we have two arms, so the command will look something like:

`vw --cb 2 --cb_type ips -d train.dat -f candidate-model.vw`

Feel free to add other options, including training over multiple passes with `--passes`. If you run the command as is above, you get the following result (actual results may vary based on VW version and the seed):

    § vw --cb 2 --cb_type ips -d train.dat -f candidate-model.vw
    final_regressor = candidate-model.vw
    Num weight bits = 18
    learning rate = 0.5
    initial_t = 0
    power_t = 0.5
    using no cache
    Reading datafile = train.dat
    num sources = 1
    Enabled reductions: gd, scorer, csoaa, cb
    average  since         example        example  current  current  current
    loss     last          counter         weight    label  predict features
    2.000000 2.000000            1            1.0    known        1        2
    1.000000 0.000000            2            2.0    known        2        2
    1.000000 1.000000            4            4.0    known        1        2
    0.750000 0.500000            8            8.0    known        1        2
    0.500000 0.250000           16           16.0    known        1        2

    finished run
    number of examples = 16
    weighted example sum = 16.000000
    weighted label sum = 0.000000
    average loss = 0.500000
    total feature number = 32

The `average loss` above is of less interest to us in this scenario since we have a separate test set. Let's load `candidate-model.vw` using `-i` and test against our test set. Remember to use the `-t` flag to disable learning, and to use the same `cb_type` as you did when training:

    $ vw -i candidate-model.vw --cb_type ips -t -d test.dat
    only testing
    Num weight bits = 18
    learning rate = 0.5
    initial_t = 0
    power_t = 0.5
    using no cache
    Reading datafile = test.dat
    num sources = 1
    Enabled reductions: gd, scorer, csoaa, cb
    average  since         example        example  current  current  current
    loss     last          counter         weight    label  predict features
    0.000000 0.000000            1            1.0    known        1        2
    0.000000 0.000000            2            2.0    known        1        2
    0.500000 1.000000            4            4.0    known        1        2
    0.250000 0.000000            8            8.0    known        1        2

    finished run
    number of examples = 8
    weighted example sum = 8.000000
    weighted label sum = 0.000000
    average loss = 0.250000
    total feature number = 16

The `average loss` reported is the OPE estimate for this policy, and in this case, calculated against our test set. Since we specified `--cb_type ips`, the IPS estimator is used, which is unbiased. Feel free to use `dr`, too, but note that although VW will allow it, *the use of `dm` is discouraged for OPE since it is biased*. If you are unsure, we suggest using `dr`. Note that VW will complain if you train using one `cb_type` but test using another; mixing estimators in training and evaluation is currently not supported.

Now that you have an OPE estimate of `0.250000`, how does it compare to the production policy in production? This comparison is easy to make since, for the same time period, we have the ground truth in the `test.dat` file. If we sum the costs in that file and divide by the number of examples, we get `0.625`. Generally, our toy example has far too little data with which to perform reliable estimates, but the principle applied: lower is better, so your candidate policy is estimated to perform better than the production policy in production. The exact definition of OPE is important: in this case, it means that had you deployed the candidate policy, with no exploration, instead of the production policy you could have expected to see the average cost reduce from `0.625` to `0.250000` _for the period of time covered in the test set_.

Feel free to gridsearch several candidate policies using the same setup, to determine a combination of hyperparameters that work well for your use case. Note however that mixing different `cb_type` options when gridsearching is discouraged, even if all specified estimators are unbiased. Choose either IPS or DR beforehand.

## Batch scenario: policy evaluation with a pre-trained VW policy, `cb_adf`-format data

The `cb_adf` format is especially useful if you have rich features associated with an arm, or a variable number of arms per round. If you have `adf`-format data, the same procedure as above applies – just change `cb` to `cb_adf` in the corresponding commands.

## Online scenario: policy evaluation with an incrementally trained VW policy, `cb`-format data

In the online scenario, when you deploy a new policy behind e.g. a REST endpoint, that policy will continue to update itself from second to second, as and when new examples come in. Learning is incremental: you don't iterate over the same training examples more than once.

Online learning is particularly useful in settings where you need to react to changes in the world as fast as possible. From the point of view of OPE, the objective is the same: determining if a candidate policy is better than the one currently in production. But the setup differs slightly: since any policy deployed will continue to learn online, the key question is how well it will generalise to new examples coming in, _considering_ that the policy is constantly evolving. So in this case, our candidate policy is in fact not a fixed policy, but one that changes constantly. It may help to think of it as a set of hyperparameters instead. The aim is to find out which set of these hyperparameters learns best in an online fashion.

To answer this question, we leverage _progressive validation_ (PV), a validation process implemented in VW. PV is explained in detail elsewhere (see e.g. [John Langford's talk on real world interactive learning](https://vimeo.com/240429210)), but for this tutorial, it is enough to know that for one-pass learning, the loss reported by progressive validation deviates like a test set yet allows you to train on all of your data. It's a good indicator of the generalisation performance for online learning.

VW reports PV loss automatically if you only iterate once over your data when training, i.e. `--passes` is `1` (the default). In this case, obtaining an OPE estimate is simply a matter of taking the bandit data from the production policy, again ordered oldest example first, and training as normal – no train/test split is needed.

First, save your bandit data to a file, e.g `data.dat`:

    1:1:0.5 | user_age:25
    2:0:0.5 | user_age:25
    2:1:0.5 | user_age:55
    2:1:0.5 | user_age:56
    2:1:0.5 | user_age:55
    2:1:0.5 | user_age:56
    1:1:0.5 | user_age:27
    2:0:0.5 | user_age:21
    2:0:0.5 | user_age:23
    2:0:0.5 | user_age:56
    2:1:0.5 | user_age:55
    2:1:0.5 | user_age:36
    1:1:0.5 | user_age:25
    2:0:0.5 | user_age:25
    2:1:0.5 | user_age:55
    2:1:0.5 | user_age:56
    2:1:0.5 | user_age:55
    2:1:0.5 | user_age:56
    1:1:0.5 | user_age:27
    2:0:0.5 | user_age:21
    2:0:0.5 | user_age:23
    2:0:0.5 | user_age:56
    2:1:0.5 | user_age:55
    2:1:0.5 | user_age:36

Then, train a policy incrementally. For the above data, with 2 possible actions and an IPS estimator, the command would be `vw --cb 2 --cb_type ips -d data.dat` (plus any additional options).

The result should look something like this:

    § vw --cb 2 --cb_type ips -d data.dat
    Num weight bits = 18
    learning rate = 0.5
    initial_t = 0
    power_t = 0.5
    using no cache
    Reading datafile = full.dat
    num sources = 1
    Enabled reductions: gd, scorer, csoaa, cb
    average  since         example        example  current  current  current
    loss     last          counter         weight    label  predict features
    2.000000 2.000000            1            1.0    known        1        2
    1.000000 0.000000            2            2.0    known        2        2
    1.000000 1.000000            4            4.0    known        1        2
    0.750000 0.500000            8            8.0    known        1        2
    0.500000 0.250000           16           16.0    known        1        2

    finished run
    number of examples = 24
    weighted example sum = 24.000000
    weighted label sum = 0.000000
    average loss = 0.416667
    total feature number = 48

The `average loss` here is the OPE estimate, calculated using progressive validation, with the `cb_type` estimator you specified. The same caveats as before apply: if the specified `cb_type` is biased, it is generally no recommended to use the average loss as an OPE estimate. Again, if you are unsure, we recommend using the default by omitting `cb_type` altogether.

To compare the OPE estimate of the candidate policy to the production policy, calculate the realised average cost in the `data.dat` file. In this case, it is `0.6667`, and since our candidate policy's `0.416667` is lower, we have found a set of hyperparameters estimated to perform better than our production online learner were it deployed at the time covered by the data.

## Online scenario: policy evaluation with an incrementally trained VW policy, `cb_adf`-format data

If you have `adf`-format data, the same procedure as above applies – just change `cb` to `cb_adf` in the corresponding commands.

## Legacy: policy evaluation with `cb`-format data, using a pre-trained policy

If your production policy produces bandit data in the standard `cb` format, and you already have a candidate policy even one trained outside VW, you can use the legacy `--eval` option to perform OPE. It is not recommended to use `--eval` if you are able to use any of the other methods described in this tutorial.

First, create a new file, e.g. `eval.dat`. Then, for each instance of your production policy's bandit data, write it to `eval.dat` but prepend the line with the action your candidate policy would have chosen given the same context. For example, if your current instance is `1:2:0.5 | feature_a feature_b` and your candidate policy chooses action 2 instead given the same context `feature_a feature_b`, write the line `2 1:2:0.5 | feature_a feature_b` (note the space!).

After you've written your data file, it might look something like this:
```
2 1:2:0.5 | feature_a feature_b
2 2:2:0.4 | feature_a feature_c
1 1:2:0.1 | feature_b feature_c
```

In the toy example above, the candidate agreed with the production policy for the second and third instances, but disagreed on the first instance.

You are now ready to run policy evaluation using the command `vw --cb <number_of_arms> --eval -d <dataset>`. In our example, we have two possible actions, so the command is `vw --cb 2 --eval -d eval.dat`. This produced the following output (your results might differ based on VW version, or the seed):

    Num weight bits = 18
    learning rate = 0.5
    initial_t = 0
    power_t = 0.5
    using no cache
    Reading datafile = eval.dat
    num sources = 1
    Enabled reductions: gd, scorer, csoaa, cb
    average since     example    example current current current
    loss   last     counter     weight  label predict features
    0.000000 0.000000      1      1.0  known    2    3
    2.500000 5.000000      2      2.0  known    2    3

    finished run
    number of examples = 3
    weighted example sum = 3.000000
    weighted label sum = 0.000000
    average loss = 6.501957
    total feature number = 9

Again, `average loss` is the OPE estimate, and can be compared against the production policy's realised average loss to determine if your candidate policy is estimated to work better than the policy in production.

Note what happens if we try to run `--eval` with an estimator we know is biased, `vw --cb 2 --eval -d eval.dat --cb_type dm`. You will end up with an error, to prevent you from making a mistake:

    Error: direct method can not be used for evaluation --- it is biased.

    finished run
    number of examples = 0
    weighted example sum = 0.000000
    weighted label sum = 0.000000
    average loss = n.a.
    total feature number = 0
    direct method can not be used for evaluation --- it is biased.
    vw (cb_algs.cc:161): direct method can not be used for evaluation --- it is biased.


## More to explore

- Explore more Vowpal Wabbit [Tutorials](https://vowpalwabbit.org/tutorials.html)
- Browse [examples on the GitHub wiki](https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Examples)
