## values-classification

Experiments with text classification for measuring values in text

### Requirements

The following conda packages are required

- tqdm (4.59.0)
- spacy (2.3.5)
- numpy (1.20.2)
- pandas (1.2.4)
- seaborn (0.11.1)
- requests (2.25.1)
- matplotlib (3.3.4)
- scikit-learn (0.24.2)
- beautifulsoup4 (4.9.3)
- python-wget (3.2) (available from conda-forge)

Install using the following command:

- `conda install tqdm spacy numpy pandas seaborn scikit-learn requests matplotlib beautifulsoup4`

- `conda install -c conda-forge python-wget`

You will also need to download the spacy model, using:

- `python -m spacy download en_core_web_sm`


### Data download

The anntoations are included with this repo, but the pdfs for all conferences need to be downloaded. Note that for NeurIPS we will download a list of pdfs for each year, whereas for ICML the links to pdfs have been manually scraped from the individual website (except for 2008, which can be download as a single compressed file)

For ICML, the following commands will download ICML pdfs (2008-2020), convert them to text, and parse them with spacy, saving the output to `data/icml`

- `python -m download.icml.download_pdfs`
- `python -m download.icml.convert_pdfs_to_text`
- `python -m download.icml.parse_papers`

For NeurIPS, the following commands will download NeurIPS pdfs (1987-2020), convert them to text, and parse them with spacy, saving the output to `data/neruips`

`python -m download.neurips.download_index`
`python -m download.neurips.download_papers`
`python -m download.neurips.convert_pdfs_to_text`
`python -m download.neurips.parse_papers`

### Classification:

To run experiments with classification, first export the annotated data to data/classification/ using:

- `python -m classification.export_training_data`

Then tokenize the text using:

- `python -m classification.tokenize`

Then select a random test set (controlled using random seed):

- `python -m classification.create_partitions`

Train a simple linear unigram classifier for each value using:

- `python -m classification.run`

Finally, make predictions on each full corpus for each of ICML and NeurIPS using:

- `python -m classification.do_prediction --dataset icml`
- `python -m classification.do_prediction --dataset neurips`

And to make the plots (output to a "plots" directory by default), run

- `python -m classification.make_plots`