Adaptive vs Non-adaptive adversary:
-----------------------------------

Note that, the rewards are defined adaptively in our policy. However, the regret bound of Putta and Agarwal specifically works
for non-adaptive adversary. The key to get around this subtelty is to note that we only require a pseudo-regret bound where 
the benchmark (x^*) is constant and does not depend on the realized rewards. In this case, we can take expectation of both sides of the 
regret bound (treating x^* as a constant) and the bound holds in expectation - precisely what we require. 

