# rl_trojan_detection
A Bayesian-based trojan detection method against RL trojan

## Code structure

The proposed explanation models are in `src`.
- `xstep_feat.py`: Run step-level and feature-level explanation at the same time. 
- `xstep.py`: Define the step-level explanation models (Deep Gaussian model and Deep Gaussian process model) and train the step explanation model.
- `xfeat.py`: Define and train the feature-level explanation.
- `utils.py`: utility functions.
The step and step_feat class has the following functions:
- `train()`: train the reward prediction model and mask parameters.
- `test()`: test the trained reward prediction model.
- `get_explanations`: get the step (regression weight) and feature (mask) explanation.
- `save`: save the trained model.
- `load`: load a well trained model.

Key parameters (the instruction of most parameters can be found in the inline comments):
- `encoder_type`: 'CNN' or 'MLP', if the observation is environment frame snapshot (image), use 'CNN', it will use CNN to transform the input observation ([n_traj, seq_len, input_channels, input_dim, input_dim], torch.float32) into the observation encoding ([n_traj, seq_len, encode_dim]).
- `likelhood_type`: 'classification' or 'regression', if final rewards are discrete, using 'classification', otherwise using 'regression'.
- `hiddens`: MLP structure or the RNN hidden dim in the CNN+RNN, suggest using the policy network structure and keep it the same for all the explainers.

## Detection workflow
- Step 1: make a new folder for the game you are working on (we keep one (type of) game(s) in one folder) with the following subfolders: `agents`, `trajs`, `exp_model_results` or naming them with your own style.
- Step 2: set up the game env, load the pretrained agent, and collect trajectories by running the agent in the environment.
  - Note 1: Run and save the trajectories.
  - Note 2: each trajectory means each game round (The agent wins/loses/ties and the game env restarts). Do not directly splitting the trajectories based on the `done` flag given by the game env. In some games, the agent may have multiple lives or the game runs multiple rounds before returning a `done`. Do a double check and set up the specific splitting flag for such games. 
  - Note 3: Save the original observations and the preprocessed ones used as policy network inputs (For better visualization purpose).
  - Note 4: the trajectories have varied lengths, pad them into the same length: pad at the end; pad with a meaningless number (Be careful with 0, '-1' and '1', it will cause confusion for rewards and categorical actions).
  - Note 5: control the traj length with some parameter like `max_ep_len` and discard the trajs that run beyond the maximum length.
  - Note 6: save every traj with a `.npz` file to prevent the out of memory issue: 
  - Note 7: the shape of the save items: 
    - observations: [max_seq_len, input_channel, input_dim, input_dim] or [max_seq_len, input_dim].
    - states (preprocessed observations): [max_seq_len, input_channel, input_dim, input_dim] or [max_seq_len, input_dim].
    - actions: [max_seq_len] or [max_seq_len, act_dim].
    - h: [max_seq_len, 1, rnn_hidden_dim] (RNN hidden states in the policy network).
    - c: [max_seq_len, 1, rnn_hidden_dim] (RNN hidden states in the policy network).
    - rewards: [max_seq_len] (instant rewards). 
    - value_function outputs: [max_seq_len].
    - final_rewards: [1].
    - max_ep_length: actual maximum traj length.
    - traj_count: total number of collected trajs.
- Step 3: load and preprocess the trajectories. 
  - Note 1: change the padded values in obs with `0` and preprocess them into states using the policy network preprocessing method.
  - Note 2: obtain the final rewards. Discrete final rewards: change the final rewards to class labels if it has negative values (e.g., [-1, 0, 1] -> [0, 1, 2]), and record the number of classes.
  - Note 3: Prepare the training and testing set. Be careful with the data type: torch.float32 for continuous variables, torch.long for integers (discrete variables).
- Step 4: Select the explanation workflow and explanation model.
  - If choosing to train step and feature explanation separately
    - Train the step explanation first using DeepGaussian or DeepGaussianProcess model.
    - Get the globally most important steps.
    - Load the observation and actions at those steps in all the trajectories.
    - Train the feature prediction model and get the explanation mask. 
  - If choosing to train step and feature explanation together
    - Train the step and feature explanation first using DeepGaussian+Mask or DeepGaussianProcess+Mask model.
    - Get the explanation mask. 
- Step 5: given an explanation mask, identify the trigger from the mask and do a fidelity test.
  - Fidelity test: compare the difference between the detected trigger and the true trigger.


## Trojan elimination workflow
See `run.sh` for the concrete commands to run our method and baselines.

