# Introduction

## Abstract

This is README for codes of paper *Satisficing Exploration in Bandit Optimization*.

You can plot both satisficing and standard regret of four methods (`SELECT`, `SATUCB`, `SATUCB+`, oracle method used in `SELECT`) for three common bandit (K Arm bandit, Lipschitz bandit, Concave bandit), changing with different time budgets. The bandit settings, time budget, and repeated times can be modified.



# Installation Requirement

- Python3 kernel
- Necessary packages:
  - `random`, `math`, `numpy`, `matplotlib`



# Algorithm Code

## Code Structure

- **main.py**: The main script to run the project. This script ties together the functionalities provided by the bandit classes.
- **K_Arm_Bandit.py**: The implementation of the `K_Arm_Bandit` class
- **Lipschitz_Bandit.py**: The implementation of the `Lipschitz_Bandit` class, with some functions rely on `K_Arm_Bandit` class as Lipschitz bandit are basically solved by discretizing it into K arm bandit.
- **Convex_Bandit.py**: The implementation of the `Convex_Bandit` class

## Code running

In line 270-273 in `main.py`, choose the flag for satisficing case and number of rounds that experiment repeats. 

```python
##### Choose satisficing and repeat number #####
SAT = True  # True for satisficing regret, False for standard regret
round_num = 200  # Number of rounds that experiment repeats to average
################################################
```

In line 275-284 in `main.py`, choose the bandit you want to test:

```python
##### Choose the bandit to run #####
# K Arm Bandit
regret_SELECT, regret_SatUCB, regret_SatUCB_plus, regret_Oracle = Test_K_Arm(round_num=round_num, SAT=SAT)

# Lipschitz Bandit
regret_SELECT, regret_SatUCB, regret_SatUCB_plus, regret_Oracle = Test_Lipschitz_Bandit_2d(round_num=round_num, SAT=SAT)

# Concave Bandit
regret_SELECT, regret_SatUCB, regret_SatUCB_plus, regret_Oracle = Test_Concave_Bandit(round_num=round_num, SAT=SAT)
#####################################
```

Then the a test on time budget ranging from $500$ to $5000$ with step $500$ will be run for all four methods: `SELECT`, `SATUCB`, `SATUCB+`, and the basic oracle (used in `SELECT`).
(Note that except for `SELECT`, the other three are not anytime algorithms, meaning that each value in `regret_ALG` is run for each budget $T$, while `regret_SELECT` is only results of one run of $T$ ranging from $1$ to $5000$)

## Setting configuration

### K Arm Bandit

In line 12-19 in `main.py`, you can change number of arm `K`,  testing time budget `time_budget`, expected reward distribution `true_mean`, satisficing level `S`, and sample noise variance `noise_var`

```python
############################
# Change the settings here
K, Total_Budget = 4, 5000
noise_var = 1.0
true_mean = np.array([0.6, 0.7, 0.8, 1])  # Note that K == len(true_mean)
time_budget = [i * 500 for i in range(1, int(Total_Budget / 500) + 1)]
S = 0.93  # 1.5 for unrealizable case
############################
```

### Concave Bandit

In line 72-101 in `main.py`, you can change testing time budget `time_budget`, quadratic reward distribution coefficients`coeff` and `best_arm`, satisficing level `S`, and sample noise variance `noise_var`

```python
############################
# Change the settings here
K, Total_Budget = 10000, 5000  # K here does not matter, a larger K only means
                               # a more accurate interpolation of lipschitz reward distribution
time_budget = [i * 500 for i in range(1, int(Total_Budget / 500) + 1)]
noise_var = 1.0
coeff, best_arm = 16, 0.25  # reward = 1 - coeff * (x - best_arm)^2
S = 0.3  # -0.5 for unrealizable cases
############################
S = 1 - S  # this is because we implement the algorithm based on convex bandit
```

Note that a larger `K` only means a more accurate interpolation of Lipschitz reward distribution. Also, note that in `SATUCB` and `SATUCB+`, we assume the algorithm knows the Lipschitz coefficient, while in `SELECT` we don't. 

The real distribution is expressed as 
$$
r(x) = 1 - \text{coeff}\cdot(x - \text{best\_arm})^2, \;x\in[0,1]
$$

which is a quadratic reward distribution with maximum reward $1$ (if $\text{best\_arm}\in[0,1]$)

### Lipschitz Bandit

In line 181-193 in `main.py`, you can change testing time budget `time_budget`, centers `centers`, coefficient `coeff`, satisficing level `S`, and sample noise variance `noise_var`

```python
############################
# Change the settings here
Total_Budget = 5000
time_budget = [i * 500 for i in range(1, int(Total_Budget / 500) + 1)]
S = 0.5  # 1.5 for unrealizable cases
centers = [(0.5, 0.7)]  # should be a list of tuples
coeff = 3
x, y = np.linspace(0, 1, 400), np.linspace(0, 1, 400)
X, Y = np.meshgrid(x, y)
Z = 0
for center in centers:
    Z += coeff * np.exp(-100 * (X - center[0]) ** 2 - 100 * (Y - center[1]) ** 2)
############################
```

The real distribution is expressed as
$$
r(x,y) = \min\{1, \text{coeff}\cdot\sum_{(a,b)\in\text{centers}}\exp(-100((x-a)^2+(y-b)^2))\},\;(x,y)\in[0,1]^2
$$
which is a 2-dimensional multi-peak reward distribution with maximum reward $1$ (if $\text{coeff}\geq1$).
