# GPT4Video

**GPT4Video: A Unified Multimodal Large Language Model for lnstruction-Followed Understanding and Safety-Aware Generation**


## Framework
![image-20230924124604776](__assets__/GPT4Video_framework_v2.png)
**Video Encoding stage:** The video encoding module employs a frozen ViT-L/14 model to capture raw video features, while the video abstraction module utilizes a transformer-based cross attention layer and two novel learnable tokens, designed to condense information along dual axes.

**LLM reasoning:** The core of GPT4Video is powered by a frozen LLaMA model, efficiently fine-tuned via LoRA. The LLM is trained with custom video-centric and safety-aligned data, enabling it to comprehend videos and generate appropriate video prompts (_indicated by underlined text_).

**Video Generation:** The prompts generated by LLM are then used as text inputs for the models in the Text-to-Video Model Gallery to create videos. We use ZeroScope as our video generation model in this work.

## Training
first, install the requestments.
```shell
   pip install -r requestments.txt
```

training model with two gpus for 10 epoches.
```python
    python train.py --devices 2 --max_epochs 10
```