# **Embedding Similarity and Data Augmentation**

This repository contains tools and workflows for embedding similarity and data augmentation. It includes both **Deep Learning (DL)** and **Tsetlin Machine (TM)** approaches for embedding generation, knowledge collection, and sentiment analysis augmentation.

---

## **Dependencies**

Ensure the following dependencies are installed:

1. **System-Level Requirements**:
   ```bash
   sudo apt install libffi-dev
   ```

2. **Python Libraries**:
   - Required libraries vary by use case (see individual sections below).
   - Install additional libraries as needed with `pip`.

---

## **Setup Instructions**

### **Tsetlin Machine Utilities (TMU)**
1. Navigate to the `tmu` directory.
2. Install the Tsetlin Machine Utilities:
   ```bash
   cd tmu
   pip install .
   ```

---

## **Embedding Similarity for Tsetlin Machines (TM)**

### **1. Training Data Preparation**
Use the `Training.ipynb` notebook to prepare the "One Billion Word" dataset:
- Convert the `train_v2.txt` file into `X.pickle` and `vectorizer_X.pickle`.
- During this process, set the vocabulary and ensure all target words are included.

### **2. Knowledge File Collection**
Run the `Phase1.ipynb` notebook:
- Use the first cell to collect knowledge files for a specific dataset or the second cell for the entire dataset.

### **3. Embedding Collection**
Run the `Phase2.ipynb` notebook:
- Collect embeddings for the target words.
- For large target word datasets, use the batched cell to improve efficiency.

---

## **Embedding Similarity for Deep Learning (DL)**

For each model, use the corresponding notebook to collect embeddings for target words:
1. **Word2Vec**: `Word2Vec.ipynb`
2. **FastText**: `FastText.ipynb`
3. **GloVe**: `GloVe.ipynb`

Follow the instructions in the notebook to extract embeddings for the desired dataset.

---

## **Data Augmentation for Sentiment Analysis**

### **Step 1: Knowledge File Collection**
Navigate to the `IMDB` directory and run the following commands to prepare the data:
1. Install required libraries:
   ```bash
   pip install numpy scikit-learn tensorflow
   ```
2. Prepare the dataset:
   ```bash
   python prepare.py
   ```
3. Install Tsetlin Machine Unified:
   ```bash
   pip install git+https://github.com/cair/tmu.git
   ```
4. Collect embeddings:
   ```bash
   python collect.py
   ```

The resulting embedding files will be stored in the `IMDbKnowledge` directory.

### **Step 2: Data Augmentation**
Inside the `Augmentation` directory:
- Each notebook (`BERT.ipynb`, `GloVe.ipynb`, etc.) contains cells for:
  1. Preparing the data.
  2. Building data augmentation.
  3. Classification and scoring.

Follow the notebook-specific steps for augmentation and sentiment analysis.

---

## **Directory Structure**

```plaintext
.
├── Augmentation
│   ├── IMDB
│   │   ├── BERT.ipynb
│   │   ├── directories.py
│   │   ├── ELMO.ipynb
│   │   ├── FastText.ipynb
│   │   ├── GloVe.ipynb
│   │   ├── TM.ipynb
│   │   ├── tools.py
│   │   ├── Word2Vec.ipynb
├── EmbeddingSimilarity
│   ├── DL
│   │   └── ...
│   ├── TM
│       ├── datasets
│       ├── result
│       ├── db.py
│       ├── DirectoriesUtil.py
│       ├── Evaluation.py
│       ├── Knowledge.py
│       ├── Phase1.ipynb
│       ├── Phase2.ipynb
│       ├── Tools.py
│       ├── Training.ipynb
```