## Efficient Large Language Model Inference with Neural Block Linearization

## Quick Start

#### Installation

```bash
conda create -n llm-drop python=3.10
conda activate llm-drop

#For NBL:
cd ./NBL
pip install -r requirements.txt


## Run NBL

#### Attn NBL on Lllama-3.1-8B
```bash
bash scripts/apply_nbl/layer_nbl_llama.sh
```

#### Attn NBL on Mistral-7B
```bash
bash scripts/apply_nbl/layer_nbl_mistral.sh
```

These bash scripts will generate the importance scores for blocks/layers, determine which blocks/layers to retain, and create new model configuration files indicating the dropped modules. The calculated model weights and CCA bound values will be saved under /llm_variables. But after the codes finished, the compressed models will be saved under "../results_prune/cache/" directory. Then, using these model paths, the evaluation scripts explained below can be runned.

## Benchmarks
#### Performance
Evaluate the performance of the model with dropping some modules on specific tasks:
```bash
bash scripts/benchmark/benchmark_lm_eval_llama.sh
```
```bash
bash scripts/benchmark/benchmark_lm_eval_mistral.sh
```

The code for NBL is adapted from [CASE-Lab-UMD/LLM-Drop](https://github.com/CASE-Lab-UMD/LLM-Drop)

The evaluation code is based on [EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness). To fully reproduce our results, please use [this version](https://github.com/s1ghhh/lm-evaluation-harness). It samples few-shot based on the index of the samples, avoiding the issue of result variation with the number of processes during data parallel inference.
Remember to use the modeling files in `src/llmtuner/model` to load the Mistral and Llama models. There exist the configuration and modeling files, where the modeling files are updated based on NBL linear weight additions.

#### SpeedUp
Evaluate the speedup ratio of the model with dropping some modules:
```bash
bash scripts/benchmark/benchmark_speed.sh
```

#### Quantization
Please refer to [AutoAWQ](https://github.com/casper-hansen/AutoAWQ). Ensure you carefully install the packages that correspond to your CUDA version.
For quantization, use the following scripts:
```bash
python quantize.py
```
This will create the quantized Llama-3.1-7B as presented in the paper. The NBL code will be better adapted to quantized models after the review process.


#### Speculative Decoding
Under the directory "Speculative/Eagle/", we have the Speculative Decoding with the NBL added. For speculative decoding + NBL, use the following scripts:

```bash
bash Speculative/EAGLE/run_speculative_mt_bench.sh
```

#### LoRA Fine-Tuning on the NBL Linear weights
Under the directory "LoRA", we have the code to reproduce the fine-tuning experiments. First to do LoRA and find the fine-tuned weights:

```bash
python lora.py
```
Then, to fuse the tuned layers with the NBL applied model:
```bash
python lora_save.py
```

#### Calibration Runtimes:

To check the GPU based NBL implementation, you may refer to the directory "Calibration Runtime", and run the script:


```bash
python calc.py
```
