This document provides all code and instructions necessary to reproduce the experiments and results described in the paper. The workflow covers dataset preparation, embedding generation, and evaluation on classification, clustering, and similarity tasks.

---

## Table of Contents

1. [Setup Instructions](#setup-instructions)
2. [Workflow Overview](#workflow-overview)
    - [1. Prepare Corpus](#1-prepare-corpus)
    - [2. Generate Embeddings](#2-generate-embeddings)
    - [3. Classification Task](#3-classification-task)
    - [4. Clustering Task](#4-clustering-task)
    - [5. Similarity Task](#5-similarity-task)
3. [Notes for Clustering Task](#notes-for-clustering-task)
4. [Notes for Classification Task (ELMO)](#notes-for-classification-task-elmo)
5. [Notes for Classification Task (BERT)](#notes-for-classification-task-bert)
6. [Pretrained Models and Datasets](#pretrained-models-and-datasets)
7. [Folder Structure](#folder-structure)

---

## Setup Instructions

1. **Open the repository and enter the `paper` directory:**
   ```bash
   cd paper
   ```

2. **Install dependencies:**
   The required Python packages are listed in `requirements.txt`.
   ```bash
   pip3 install -r requirements.txt
   ```

---

## Workflow Overview

You can run the entire pipeline with the provided script:
```bash
./run_all.sh
```
Or, execute each step individually as described below.

---

### 1. Prepare Corpus

Prepare the dataset corpus (IMDB or 1 Billion Word) for embedding generation.

- **IMDB**: Downloaded automatically from Keras.
- **1 Billion Word**: Download `train_v2.txt` from [Download Link](https://drive.google.com/drive/folders/1yUE-wWSvQFQzimdBM4aBJOcjtGgIgUrV?usp=sharing) and place it in the appropriate folder.

This step converts the raw dataset into:
- `X.pickle`
- `vectorizer_X.pickle`
- `tokenized_sentences.pickle`

All output files are stored in the `data` folder.

**Run:**
```bash
python3 prepare_corpus.py --dataset_name $DATASET
```
Where `$DATASET` is either `imdb` or `1billion`.

---

### 2. Generate Embeddings

Generate embeddings for the selected model using the prepared corpus.

- Configure hyperparameters in `config.py` if needed.

**Run:**
```bash
python3 embedding-generator/main.py --dataset_name $DATASET
```

You will be prompted to select the embedding model:
- 1. Word2Vec
- 2. FastText
- 3. GloVe
- 4. Omni TM-AE

The output embedding files are stored in `data/$DATASET/`, where `$DATASET` is the selected dataset (`imdb` or `1billion`).

Pretrained embedding files are also available at:  
[Download Link](https://drive.google.com/drive/folders/1yUE-wWSvQFQzimdBM4aBJOcjtGgIgUrV?usp=sharing)

- Word2Vec: `word2vec.model`
- FastText: `fasttext.model`, `fasttext.model.wv.vectors_ngrams.npy`
- GloVe: `glove.model`
- Omni TM-AE: `omnitm.model`, `omnitm_alternatives_cache.pickle` (for classification)

For `omnitm.model` the file available at:
[Omni Download Link](https://drive.google.com/file/d/1V2b-zySkwTd0CECIkIACuNEj3zlTLgCN/view?usp=sharing)

---

### 3. Classification Task

Evaluate embeddings on sentiment analysis using various classifiers.

**Run:**
```bash
python3 classification/main.py --dataset_name $DATASET
```

- Embedding models evaluated: `["word2vec", "fasttext", "glove", "omnitm", "bert", "elmo"]`
- Classifiers used:
    - RandomForestClassifier
    - LogisticRegression
    - MultinomialNB
    - LinearSVC
    - MLPClassifier
    - TM Classifiers

Results are saved as CSV files containing accuracy scores for each classifier and embedding model.

---

### 4. Clustering Task

Evaluate embeddings on document clustering tasks.

**Run:**
```bash
python3 clustering/main.py --dataset_name $DATASET
```

- Embedding models evaluated: `["word2vec", "fasttext", "glove", "omnitm"]`
- Datasets: `["20newsgroups", "reuters", "yelp", "amazon", "ag_news"]`

Dataset files can be found at:  
[Download Link](https://drive.google.com/drive/folders/1yUE-wWSvQFQzimdBM4aBJOcjtGgIgUrV?usp=sharing)

Results are saved as CSV files with NMI and ARI scores.

---

### 5. Similarity Task

Evaluate embeddings on word similarity datasets annotated by humans.

**Run:**
```bash
python3 similarities/main.py --dataset_name $DATASET
```

- Datasets: `["rg-65", "wordsim353-sim", "mturk-287", "mturk-771", "men", "simlex999"]`
- Embedding models evaluated: `["word2vec", "fasttext", "glove", "omnitm"]`

You can control whether to use pretrained embeddings or retrain each iteration:
```python
use_pretrained = False
```

Results are saved as CSV files with Spearman and Kendall correlation metrics.

---

## Notes for Clustering Task

To run clustering tasks efficiently, follow these steps:

1. **Install Required System Packages**
   ```bash
   sudo apt install python3.10-venv
   ```

2. **Create and Activate a Virtual Environment**
   ```bash
   python3.10 -m venv my-rapids-env
   source my-rapids-env/bin/activate
   ```

3. **Install RAPIDS cuML (CUDA 12)**
   ```bash
   pip install cuml-cu12 --extra-index-url=https://pypi.nvidia.com
   ```

4. **Install Additional Python Libraries**
   ```bash
   pip install scikit-learn nltk gensim
   ```

**Run the clustering script:**
```bash
python3 clustering/main.py
```

---

## Notes for Classification Task (ELMO)

Some ELMO dependencies require a specific setup.

1. **Create a Virtual Environment**
   ```bash
   cd /workspace
   python3 -m venv elmo_env
   ```

2. **Activate the Environment**
   ```bash
   source elmo_env/bin/activate
   ```

3. **Install Dependencies**
   - Upgrade pip:
     ```bash
     pip install --upgrade pip
     ```
   - Install TensorFlow and TensorFlow Hub:
     ```bash
     pip install tensorflow==2.12.0 tensorflow-hub
     ```
   - Install other dependencies:
     ```bash
     pip install pycuda numpy tqdm scikit-learn fasttext torch transformers pandas tmu
     ```

4. **Run the code using the new environment for ELMO tasks.**

---

## Notes for Classification Task (BERT)

To run BERT-based classification tasks and avoid ImportError for AdamW, use the following environment setup:

1. **Create a new Python virtual environment:**
   ```bash
   python3 -m venv bert_env
   source bert_env/bin/activate
   ```

2. **Upgrade pip:**
   ```bash
   pip install --upgrade pip
   ```

3. **Install compatible versions of torch and transformers:**
   ```bash
   pip install torch transformers==4.28.0
   ```
   

4. **Install other required libraries as needed:**
   ```bash
   pip install pandas scikit-learn tmu tensorflow
   ```

---

## Pretrained Models and Datasets

All required datasets and pretrained embedding files can be downloaded from:  
[Download Link](https://drive.google.com/drive/folders/1yUE-wWSvQFQzimdBM4aBJOcjtGgIgUrV?usp=sharing)

---

## Folder Structure

```
paper/
├── README                # This document
├── requirements.txt      # Python dependencies
├── run_all.sh            # Script to run the full pipeline
├── prepare_corpus.py     # Script for dataset preparation
├── data/                 # Folder for processed data and 
│   └── imdb              # IMDB folder for models and dataset corpus
│   └── 1billion          # 1 Billion Word folder for models and dataset corpus
│   └── ...               # Othere datasets files (csv and json files)
├── embedding-generator/
│   └── main.py           # Embedding generation entry point
│   └── config.py         # Configuration for embedding generation
│   └── ...               # Othere folder and files
├── classification/
│   └── main.py           # Classification task entry point
│   └── ...               # Othere folder and files
├── clustering/
│   └── main.py           # Clustering task entry point
│   └── ...               # Othere files
├── similarities/
│   └── main.py           # Similarity task entry point
│   └── config.py         # Configuration for embedding generation
│   └── ...               # Othere folder and files
└── ...                   # Other auto-generated result csv files
```

---