# Dynamic-Precision LLM
## Installation
First, install Any-Precision LLM.
```bash
cd any-precision-llm
pip install -r requirements.txt
cd any_precision/modules/kernels
pip install .
```
Then, copy modeling changes to the installed `transformers` package to use asynchronous estimation.
```bash
cp any-precision-llm/modeling/modeling_llama.py /transformers_installation_path/models/llama/modeling_llama.py
```
The `lm-evaluation-harness` package prints model info at the end of evaluation. Our current implementation of DP-LLM might cause maximum recursive depth error. So, in `evalautor.py` of `lm-evaluation-harness`, we comment out the following code:
```python
if isinstance(lm, lm_eval.models.huggingface.HFLM):
    results["config"].update(lm.get_model_info())
``` 
## Running DP-LLM
### Step 1. Find layer-wise maximum precision
Run `find_maxmem.py` to find layer-wise maximum precision.
```bash
python find_maxmem.py --maxmem 5.0 --targ_bits 3.5 4.0 4.5 --hessian_path path/to/hessian --w_path path/to/weights --save_dir /save/path/for/maxmem
```
### Step 2. Fine tune to find layer-wise average precision
Run `fine_tune.py` to find layer-wise average precision.
```bash
python fine_tune.py --maxmem 5.0 --targ_bits 3.5 --model_path /path/to/any/precision/model --maxmem_dir /path/to/saved/maxmem/dir --save_path /save/path/for/average/precision
```
### Step 3. Save estimator parameters and threshold
Run `save_estimator.py` to create random projection based error estimators.
```bash
python save_estimator.py --arr_path /path/to/average/precision --x_path /path/to/saved/activations/ --w_path /path/to/weights --err_dir /save/path/for/estimator --yerr_dir /save/path/for/cached/intermediate/results
```
Then, run `save_th.py` to translate average precisions to thresholds for each layer.
```bash
python save_th.py --arr_path /path/to/average/precision --x_path /path/to/saved/activations/ --w_path /path/to/weights --err_dir /save/path/for/estimator --yerr_dir /save/path/for/cached/intermediate/results
```
### Step 4. Use DP-LLM
In a python code, create a model via `AnyPrecisionForCausalLM_3456` to create a DP-LLM model.
```python
from any_precision import AnyPrecisionForCausalLM_3456
model = AnyPrecisionForCausalLM_3456.from_quantized(model_path, 
        precisions=[3,4,5,6], mode="jl", path_dict=path_dict
        )
```
The `path_dict` variable should be a dictionary variable containing path infos for the model.
```python
path_dict={"corr_arr_path": args.corr_arr_path,
            "max_mem_dict_path": args.max_mem_dict_path,
            "targ_path_fn": 
            lambda x: f"{args.targ_path}/{x[0]}_{x[1]}_targ.pt",
            "jl_path_fn": 
            lambda x: f"{args.jl_path}/{x[0]}_{x[1]}_jl.pt",
            }
```
`test_gsm8k.py` is an example test code for evaluating GSM8K dataset.

## Latency evaluation for DP-LLM
### Step 1. Install kernels for latency evaluation
```bash
cd dp-llm/gpt-fast/kernels
pip install .
```
### Step 2. Copy configurations
```bash
cd dp-llm/gpt-fast
mkdir config
```
In the `config` directory, copy the followings into it.
- `llama-3-8b_corr_arr_0.9.pt`: correlation information generated by `save_estimator.py`
- `llama-3-8b_th_arr.pt`: threshold values generated by `save_th.py`
- `llama-3-8b_th_arr_layerbits.pt`: binary array to whether use high precision or low precision to match target latency. The .pt file should contain an array with 0 or 1s where 0 indicate the use of low precision and 1 indicate the use of high precision.
### Step 3. Run latency evaluation
```bash
python generate_predef.py --compile 2 --num_samples 10 \
        --prompt "hi" --model_name "llama3" --bitwidth 6 \
        --dtype "float16" \ --quant_algo sqllm \
        --max_new_tokens 100 --n_tb 8 --k 0 0 0 0 --print_result
```