# MDIR: Matrix-Driven Detection and Reconstruction of LLM Weight Homology

This repository contains the code and figures for the article **"Matrix-Driven Detection and Reconstruction of LLM Weight Homology."** It provides tools to compare large language models (LLMs) by detecting and reconstructing weight homologies between their layers.

## Table of Contents
- [MDIR: Matrix-Driven Detection and Reconstruction of LLM Weight Homology](#mdir-matrix-driven-detection-and-reconstruction-of-llm-weight-homology)
  - [Table of Contents](#table-of-contents)
  - [Getting Started](#getting-started)
  - [Main Usages](#main-usages)
    - [1. Compare Two Models](#1-compare-two-models)
      - [Basic Usage](#basic-usage)
      - [Supported Model Formats](#supported-model-formats)
      - [Options](#options)
      - [Example Command](#example-command)
    - [2. Reproduce the Figures in the Article](#2-reproduce-the-figures-in-the-article)
      - [Prerequisites](#prerequisites)
      - [Steps](#steps)
  - [Additional Notes](#additional-notes)
  - [Hardware Requirements](#hardware-requirements)

---

## Getting Started

To set up the environment, install the required dependencies using the following command:

```bash
pip install -r requirements.txt
```

Ensure you have Python 3.10 or higher installed.

---

## Main Usages

### 1. Compare Two Models

Use the script `main_mdir.py` to compare two models and analyze their weight homologies.

#### Basic Usage
```bash
python main_mdir.py --model_A_dir <your_model_A> --model_B_dir <your_model_B> --num_layers 10
```

#### Supported Model Formats
The script supports models in the **HuggingFace format**. Replace `<your_model_A>` and `<your_model_B>` with the actual directories of your models.

#### Options
| Option                  | Description                                                                 |
|-------------------------|-----------------------------------------------------------------------------|
| `--model_A_dir`         | Directory for model A                                                      |
| `--model_B_dir`         | Directory for model B                                                      |
| `--output_dir`          | Directory to store output figures (default: `./output`)                    |
| `--solve_mlp`           | Solve weight relations between MLP layers (default: `1`)                   |
| `--solve_attn`          | Solve weight relations between attention layers (default: `1`)             |
| `--num_layers`          | Number of layers to analyze                                                |
| `--plot_all`            | Plot all figures (default: `1`)                                            |
| `--plot_full`           | Generate full-sized figures (default: `0`). Note: This takes more time and disk space. |
| `--head_size`           | Attention head size (default: `128`)                                       |
| `--mlp_heuristic`       | Use heuristic to speed up solving linear assignments for MLP layers (default: `1`) |

#### Example Command
```bash
python main_mdir.py \
    --model_A_dir ./models/model_a \
    --model_B_dir ./models/model_b \
    --num_layers 10
```

---

### 2. Reproduce the Figures in the Article

To reproduce the figures presented in the article, follow these steps:

#### Prerequisites
1. **Download Required Models**:  
   You **don't** need to download the entire model files. Only ensure the following files are present:
   - Configuration files (`config.json`)
   - Tokenizer files (`tokenizer.json`, etc.)
   - `model.safetensors.index.json`
   - The specific shard containing `"model.embed_tokens.weight"` (usually `model_00001_of_xxxxx.safetensors`).

   This approach minimizes disk space usage.

2. **Set Paths**:  
   Update the paths in the notebooks to point to your downloaded model directories.

#### Steps
- Run `draw_trace_p.ipynb` to reproduce **Figure 3**.
- Run `layer_match.ipynb` to reproduce **Figure 8**.

---

## Additional Notes 

- **Performance Considerations**:  
  Generating full-sized figures (`--plot_full 1`) can be computationally expensive and may require significant disk space. Use this option only if necessary.

- **Heuristic Solution**:  
  By default, the script uses a heuristic to speed up solving the `linear_assignment` problem for MLP layers. If you encounter issues or want exact results, disable this feature using `--mlp_heuristic 0`.

---

## Hardware Requirements
The code has been tested on the following hardware configuration:

- CPU: AMD Ryzen 9 7950X
- GPU: NVIDIA RTX 4090 (24GB VRAM)
- Memory: 128GB DDR5 RAM
- Disk: Fanxiang 4TB SSD

The code may also run on less powerful hardware. 