
## Code for ICLR 2023 Submission 'Relaxed Attention for Transformer Models'

In this directory the code which was used to conduct experiments in the submission can be found. We use different toolkits and environments in the different subfolders for each tasks we investigated in the paper. Please follow the README.md instruction in each of the subfolders to train models with **relaxed attention** for the respective task. 

The main modification of the standard multi-head attention to include relaxed attention as in equation (1) in the paper is short and simple and depicted in the following pseudo python code which is an extract of the multi-head attention function: 

```
## Extract of the multi-head attention function
...
    # Apply Softmax
    attn_weights = utils.softmax(scaled_dot_prduct, dim=-1)
                            
    # Relaxed Attention
    if self.training or self.relaxation_matched_inference and self.relaxed_attention_gamma > 0:
        relaxation_tensor = torch.ones_like(attn_weights)
        relaxation_tensor.fill_(self.relaxed_attention_gamma)     
            
        attn_weights = attn_weights \
            .mul(1 - relaxation_tensor) \
            .add(relaxation_tensor * (1 / src_len))  
            
    # Apply attention dropout           
    attn_probs = self.dropout_module(attn_weights)
    
...
```

Note, that for relaxed self- or cross attention, the respective multihead attention layers need to be initialized with the respective relaxed_attention_gamma coefficient. See e.g. [machine-translation/fairseq/fairseq/modules/multihead_attention.py](machine-translation/fairseq/fairseq/modules/multihead_attention.py) for example.

