# CONSTITUENCY TREE REPRESENTATION FOR ARGUMENTUNIT RECOGNITION

Last update : October 5, 2021

This folder contains the code associated to the paper : CONSTITUENCY TREE REPRESENTATION FOR ARGUMENTUNIT RECOGNITION under review at ICLR 2022.

We realise the python scripts and the jupyter notebooks (they both run the same code) in order to preserve the clarity and the reproductibility of each part of the project.

## Setup

The "Data Download" and the "Data Preparation" steps came from https://github.com/trtm/AURC. They are mandatory in order to download and use the AURC dataset.

### Configuration requirements

We try this code both on CPU and GPU. 
It is required to have 30 GB of CPU or GPU memory in order to run the code.
We developed this project on a Linux Ubuntu server.

### Create a python virtual environment 

Please follow the tutorial : https://docs.python.org/3/tutorial/venv.html in order to create your virtual environment.

### Data Download

The following script install the library required to run the project and also download the AURC dataset.

```
sh download.sh
```

If you are on Windows you can open the "download.sh" file and run each python command separately.

### Data Preparation

This is a prepocessing step which transform the AURC dataset to .json format.

```
python3 utils/preparation_AURC.py
```

### Pytorch Geometric Dataset

The folder "Construct_dataset" contains 3 notebooks and 3 python scripts in order to create the pytorch geometric dataset. 
In order to run the pytorch geometric models, you need to run either the jupyter notebooks or the python scripts.

Before running these files, you need to create a folder where you will save the dataset.
We suggest you to create the folder data/aurc/bert/end_2_end_2_Cross containing the three folders : Dev/processed ; Test/processed and Train/processed


Code to run the python scripts : 
```
python3 models/Construct_dataset/dataset_creation_end2end_depth2.py
python3 models/Construct_dataset/dataset_creation_end2end_depth3.py
python3 models/Construct_dataset/dataset_creation_end2end_depth4.py
```

For more information about pytorch geometric, please see https://pytorch-geometric.readthedocs.io/en/latest/

## Models 

The models are based on the transformer library from hugging face. https://huggingface.co/transformers/model_doc/bert.html#bertfortokenclassification
The CRF part of the model is based on the library : https://pytorch-crf.readthedocs.io/en/stable/

There is 5 models available : 

- BERT-GNN-CRF with BERT fine tunned
- BERT-GNN-CRF with BERT not fine tunned
- BERT-GNN with BERT fine tunned
- BERT-CRF with BERT fine tunned
- BERT with BERT fine tunned

### Hyper-parameter optimizations

The Optuna folder contains the cripts to run the hyperparameter optimization.
It is based on the library Optuna https://optuna.org/. The results of the hyperparameter exploration is save in the folder models/optuna_db

There is three scripts :

"script_optuna_GNN.ipynb" or "script_optuna_GNN.py" : This is the code to find the hyperparameters for the model BERT-GNN when the BERT model is NOT fine tuned. You can modify the maximal depth allowed for the Constituency Tree and also the In Domain or Cross Domain setting inside the script by changing the datadir path.

"script_optuna_GNN_CRF.ipynb" or "script_optuna_GNN_CRF.py" : This is the code to find the hyperparameters for the model BERT-CRF-GNN when the BERT model is NOT fine tuned. You can modify the maximal depth allowed for the Constituency Tree and also the In Domain or Cross Domain setting inside the script by changing the datadir path.

"script_optuna_GNN_CRF_end2end.ipynb" or "script_optuna_GNN_CRF_end2end.py" : This is the code to find the hyperparameters for the model BERT-CRF-GNN is fine tuned. You can modify the maximal depth allowed for the Constituency Tree and also the In Domain or Cross Domain setting inside the script by changing the attribute actual_number_tested of the main function.

Code to run the python scripts : 
```
python3 models/Optuna/script_optuna_GNN.py
python3 models/Optuna/script_optuna_GNN_CRF.py
python3 models/Optuna/script_optuna_GNN_CRF_end2end.py
```

The main function contains many optuna parameters that can be changed to match your need.

### Count Relation Node Leaf

The file "Count_relation_node_leaf.ipynb" is the jupyter notebook where we compute the statistics presented in the part 3.2 table  of the paper.

### Interpretability 

The "interpretability.ipynb" file contains the two interpretability models tested in the article.

In a first step, it trains and tests the BERT-CRF-GNN model.
In a second step, it runs the interpretability models.

The integrated Gradient method is based on the code from the Captum Library : https://captum.ai/

### Utility functions

The "utils" folder containes important function in order to preprocess the dataset.
Some of them were developed by the AURC team and copied from https://github.com/trtm/AURC/blob/master/src/utils.py : 

## Citation 
Not available yet.



The code will be published on Github if the paper is accepted at ICLR.