## 🛠️ Installation

To install from source:
```shell
cd VLAC
pip install -e .
```
Running Environment:

|              | Range        | Recommended | Notes                                     |
| ------------ |--------------| ----------- | ----------------------------------------- |
| python       | >=3.9        | 3.10        |                                           |
| cuda         |              | cuda12      | No need to install if using CPU, NPU, MPS |
| torch        | >=2.0        |             |                                           |
| transformers | >=4.51       | 4.51.3      |                                           |
| peft | >=0.15.2       |      |                                           |
| ms-swift |        | 3.3      |                                           |


## 🚀 Quick Start

```python
from evo_vlac import GAC_model
from evo_vlac.utils.video_tool import compress_video
import os
#Consistent with the web interface, the value and citic rewards of video input can be evaluated.


#assign local model path
model_path="set to your local model path"

#assign video path and task description
test_video='evo_vlac/examples/videos/pick-bowl-test.mp4'
ref_video='evo_vlac/examples/videos/pick-bowl-ref.mov'
task_description='Put up the bowl and place it back in the white storage box.'

#init model
Critic=GAC_model(tag='critic')
Critic.init_model(model_path=model_path,model_type='internvl2',device_map=f'cuda:0')
Critic.temperature=0.5
Critic.top_k=1
Critic.set_config()
Critic.set_system_prompt()

# transform video
test_video_compressed = os.path.join(os.path.dirname(test_video),"test.mp4")
_,output_fps=compress_video(test_video, test_video_compressed,fps=5)
reference_video_compressed = None
if ref_video:
    reference_video_compressed = os.path.join(os.path.dirname(ref_video),"ref.mp4")
    compress_video(ref_video, reference_video_compressed,fps=5)


# generate Critic results
result_path,value_list,critic_list,done_list = Critic.web_trajectory_critic(
    task_description=task_description,
    main_video_path=test_video_compressed,
    reference_video_path=reference_video_compressed,#if None means no reference video, only use task_description to indicate the task
    batch_num=5,#batch number
    ref_num=6,#image number used in reference video
    think=False,# whether to CoT
    skip=5,#pair-wise step
    rich=False,#whether to output decimal value
    reverse_eval=False,#whether to reverse the evaluation(for VROC evaluation)
    output_path="results",
    fps=float(output_fps),
    frame_skip=True,#whether to skip frames(if false, each frame while be evaluated, cost more time)
    done_flag=False,#whether to out put done value
    in_context_done=False,#whether use reference video to generate done value
    done_threshold=0.9,#done threshold
    video_output=True#whether to output video
)


print("=" * 100)
print(">>>>>>>>>Critic results<<<<<<<<<<")
print(" ")

print(f"result path: {result_path}")
print(f"task description: {task_description}")
print("=" * 50)

print("value_list:")
print(value_list)
print("=" * 50)

print("critic_list:")
print(critic_list)
print("=" * 50)

print("done_list:")
print(done_list)
print("=" * 100)
```

Due to the large size of the model, the model weight, corresponding test video and image data will be released public once our paper is accepted.

If the GPU memory is insufficient, please reduce the number of "batch_num".

More examples  of 

• pair-wise image inputs critic. Please check [this example](evo_vlac/examples/image_pair-wise_critic_example.py)

• vla action generation. Please check [this example](evo_vlac/examples/vla_example.py)

• data refinement. Please check [this example](evo_vlac/examples/data_filtering_example.py)


For training
```
NPROC_PER_NODE=4 \
CUDA_VISIBLE_DEVICES=0,1,2,3 swift sft \
    --model $model_path \
    --model_type internvl2 \
    --dataset $file_name \
    --val_dataset $file_name \
    --num_train_epochs 3 \
    --train_type full \
    --learning_rate 2e-5 \
    --weight_decay 0.1 \
    --adam_beta2 0.95 \
    --warmup_ratio 0.05 \
    --max_grad_norm 1 \
    --eval_step 200 \
    --save_total_limit 5 \
    --save_steps 200 \
    --seed 42 \
    --dataloader_num_workers 4 \
    --truncation_strategy delete \
    --logging_steps 5 \
    --max_length 10240 \
    --lr_scheduler_type cosine  \
    --torch_dtype bfloat16 \
    --output_dir $model_name \
    --system "You are a visual-language assistant designed to interpret spatial and task-related information from images and text. Provide precise, context-aware responses and actionable guidance to assist in achieving task objectives." \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --deepspeed zero3
```

