# Revisiting Discrete Soft Actor-Critic
This repository is the implementation of our optimized soft actor-critic algorithm for discrete action spaces, and it is based on the open-source [tianshou](https://github.com/thu-ml/tianshou) codebase.

## Requirements
- python: 3.6+
- gym>=0.23.1
- torch>=1.4.0
- numba>=0.51.0
- tensorboard>=2.5.0
- atari_py
- tqdm

## Doc
```
.
├── README.md
├── requirements.txt
└── src
    ├── examples
    │   └── atari
    │       ├── atari_network.py
    │       ├── atari_sac.py ##main program
    │       ├── atari_wrapper.py
    ├── libs ## modify tianshou code for discrete SAC alternative design 
    │    
    └── tianshou ##tianshou  library code,version 0.4.9

```


## Usage

1. run base discrete SAC for Pong  10m steps
```
cd src
python3 examples/atari/atari_sac.py --task PongNoFrameskip-v4 --epoch 200  --step-per-epoch 50000
```

2. run  discrete SAC with entropy-penalty for Pong  10m steps
```shell
cd src
python3 examples/atari/atari_sac.py --entropy-penalty --task PongNoFrameskip-v4 --epoch 200  --step-per-epoch 50000
```
3.run  discrete SAC with double avg q for Pong  10m steps
```shell
cd src
python3 examples/atari/atari_sac.py --avg-q --clip-q  --task PongNoFrameskip-v4  --epoch 200  --step-per-epoch 50000
```

4. run discrete SAC with both alternative designs for for Pong  10m steps
```shell
cd src
python3 examples/atari/atari_sac.py --avg-q --clip-q --entropy-penalty --task PongNoFrameskip-v4  --epoch 200  --step-per-epoch 50000
```

## Recording Resource
We provide the recordings of the agents' gameplay for two algorithms (SDSAC & DSAC+avg-q). There is a distinct difference in the policies: the agent combining both techniques (Kangaroo_entropy+avgq.gif) not only has strong avoidance abilities but also can score by hitting bullets; the agent that only applies avg-q (Kangaroo_avgq.gif) has weaker avoidance skills, but its more aggressive strategy allows it to score significantly faster by directly hitting monkeys.

We upload a video of one of the matches between SDSAC-48h and DSAC-24h, named "dsac24h_battle_sdsac48h.mov", as an attachment.  In the video, the red agent represents SDSAC, and the blue agent represents DSAC. It can be observed that the red agent's current K/D ratio is 3/1 and that it has successfully bypassed the blue agent's defensive tower range to eliminate the blue agent, indicating that the playing capability of SDSAC-48h significantly surpasses that of DSAC-24h. 

Another GIF named "dsac48h_battle_sdsac48h.gif" represents the match between the blue agent SDSAC-48h and the red agent DSAC-48h. From this battle, it can be observed that the blue agent has a higher skill hit rate than red agent.
