*** Code for the paper "Policy Smoothing for Provably Robust Reinforcement Learning" ****

This directory contains the code to reproduce the paper "Policy Smoothing for Provably Robust Reinforcement Learning".


Usage:

For each environment (cartpole_simple, cartpole_multiframe, freeway, mountain_car, pong_1r, and pong_full), the following commands exist:

	Train: 

	python3 cartpole_simple_train.py  --sigma 0.2  # Trains a model with policy smoothing (std. dev 0.2)
	python3 cartpole_simple_train.py  --sigma 0.0  # Trains an undefended model 

	Test clean model (for generating certificates):

	python3 cartpole_simple_test.py  --sigma 0.2 --checkpoint cartpole_simple_sigma_0.2/best_model.zip

	Attack undefended model:

	python3 cartpole_simple_attack.py --checkpoint  cartpole_simple_sigma_0.0/best_model.zip

	

For environments in the main text ( cartpole_multiframe, freeway, mountain_car, pong_1r):

	We can attack the smoothed agents (not very successfully):

	python3 pong_1r_attack_smooth.py --sigma 0.2  --checkpoint  pong_1r_sigma_0.2/best_model.zip

	Once all  models have been tested and attacked, we can produce the figures from the main text:

	python3 cartpole_1r_plot_certs.py
	python3 cartpole_1r_plot_attacks.py


Some trained models have been included (as allowed by space.)

Caveats:
	
	- For plotting to work correctly, use the "--store_all_rewards" flag when testing or attacking Atari environments 
	- For Atari environments, the command line arguments are inconsistent over whether the pixel scale is [0,1] or [0,255]. In particular:
		For _train.py and _test.py, the --sigma argument is on the pixel scale [0,1]; however, the checkpoint name will have sigma scaled on [0,255]
		For _attack.py, --attack_eps and --attack_step are scaled on [0,255].
		For _attack_smooth.py, --sigma, --attack_eps, and --attack_step are all scaled on [0,255].
	- As discussed in the paper, freeway environments were trained 5 times to deal with training instability, and the best model selected (based on validation performance reported during training). This was done manually from the validation logs (using the final reported "New best reward.") Freeway checkpoints included represent these best models. 