## Detecting and Mitigating Indirect Stereotypes in Word Embeddings

Disclaimer: This project contains original C code, modified from GloVe.  Although no errors are known to the authors, the new code has not been thoroughly tested for security and safety.  Use at your own risk.

### References

Code for this paper modified from:

* GloVe (https://github.com/stanfordnlp/GloVe)
* word2vec on Wikipedia (https://github.com/jind11/word2vec-on-wikipedia)
* Word Embedding Benchmarks (https://github.com/kudkudak/word-embeddings-benchmarks)

###Initial setup

1. Compile the modified GloVe code by running "make all" in the GloVe-Bias directory

*To replicate experiments from the paper:*

2. Download StanfordCoreNLP from https://stanfordnlp.github.io/CoreNLP/download.html (if you do not already have it)
3. Update CORENLP_PATH in umbc_corpus_prepare.py to the StanfordCoreNLP directory (path should be absolute or relative to data directory)
4. Download Counterfactual Data Substitution from https://github.com/rowanhm/counterfactual-data-substitution
5. Update CDS_PATH in umbc_cds.py to the Counterfactual Data Substitution directory

###Reproducing experiments from the paper

1. Run prepare-umbc.sh
2. Run "python umbc_cds.py" (optional, only for CDS experiments)
3. Run "python create_embeddings.py" (comment out last line if no CDS)
4. Open "bias-and-semenatic-tests.ipynb" and run the tests in the file

###New experiments using the proposed bias mitigation method

There are two options:

A. Run embeddings.py from the command line with the following syntax:

python embeddings.py [corpus_file] [output_file_root]

Run python embeddings.py --help for a list of command line arguments.

B. import embeddings.py into python code, and run embeddings.glove_mitigated(...), using the same parameters as the command line call.  You can also use .glove(...) to run standard GloVe and .glove_both(...) to run both methods on the same corpus at the same time (which can skip a redundant call to cooccur.c).

You can also import embeddings_aux.py or run the C code directly if you would like to run only one part of GloVe.  See glove_all(...) in embeddings.py for an example.

This code has the following quirks:
* It's impossible to call standard GloVe from the command with embeddings.py, although you can with the provided GloVe files.
* X_words and Y_words are both passed in as comma separated lists in a command line argument.  For this reason, there cannot be a comma in any string in either of these two lists.  Spaces won't work either, although GloVe's cooccur.c doesn't count multiple-word tokens anyway.
*For the same reason as above, X_words + Y_words as a comma separated string CANNOT be more than 1000 characters (including null terminator) or a buffer overflow will occur.  In the Python code there is a check for this, but in the C code there is not.  This limitation on the parameters is a bug in GloVe.

### New experiments using the proposed tests:

Load weat.py and call weat.weat() with the tests=('in_ja', 'in_ma', 'in_sa', 'in_ca').  See the included Jupyter notebook for more details.

### Licenses

GloVe is licensed under an Apache License and the other two are licensed under an MIT license.  All modifications to GloVe are in cooccur.c and are labeled with comments. Word Embedding Benchmarks is in the "web" folder, and word2vec on Wikipedia is the basis for "umbc_corpus_prepare.py".  All new code is licensed under an MIT license.  These two licenses appear in the licenses folder.
