# mllm_videos

This repository contains scripts and resources for multimodal large language model (MLLM) video and audio analysis, focusing on brain alignment with multimodal stimuli.

## Table of Contents
- [Installation](#installation)
- [Project Structure](#project-structure)
- [Dataset Organization](#dataset-organization)
- [Usage](#usage)
- [Available Models](#model-information-and-hugging-face-links)
- [Results and Visualization](#results-and-visualization)

## Installation
1. Install dependencies:
```bash
pip install -r requirements.txt
```

2. Set up environment variables (optional):
```bash
export HUGGINGFACE_CACHE_DIR="./"
export HF_HOME="./"
export HF_DATASETS_CACHE="./"
export TRANSFORMERS_CACHE="./"
```

## Project Structure
```
mllm_videos/
├── brain_predictions/          # All Python scripts
│   ├── extract_embeddings/   # Scripts for embedding extraction
│   └── align_embeddings/     # Scripts for brain alignment
├── dataset/                  # Dataset organization
│   ├── wolf_videos/         # Wolf movie segments
│   ├── bourne_videos/       # Bourne movie segments
│   └── life_videos/         # Life documentary segments
└── results/                 # Generated results and visualizations
```

## Dataset Organization

The project uses three types of video datasets:
- **Wolf Videos**: 17 folders containing segments from wolf-related content
- **Bourne Videos**: 10 folders from the Bourne movie series
- **Life Videos**: 5 folders from nature documentaries
- **Hidden Videos**: 12 folders containing figure videos

Each video segment is processed through multiple MLLM models to generate embeddings and textual descriptions.

## Usage
1. Navigate to the `brain_predictions/extract_embeddings` directory to find all video and audio models Python scripts for extracting embeddings and generating sentences.
2. Navigate to the `brain_predictions/align_embeddings` directory to find all voxelwise encoding models Python scripts for performing brain encoding.
3. Run scripts as needed for your analysis or processing tasks.

## Requirements
- Python 3.x
- See individual scripts for specific dependencies.

## Description
This project includes tools for:
- Instruction-tuned video MLLMs, audio MLLMs and brain alignment with multimodal stimuli
- Model evaluation and analysis
- Data extraction and summarization

For more details, refer to comments within each script or contact the repository maintainer.

## How to Extract Embeddings from Instruction-tuned video and audio MLLMs and Generate output tokens

All Python scripts are in the `brain_predictions/extract_embeddings` folder. You can run any script from the command line using:

```bash
python brain_predictions/extract_embeddings/<script_name.py> [arguments]
```

Below are examples for running key scripts:

### Batch Processing Wolf Videos
```bash
python brain_predictions/extract_embeddings/process_wolf_videos.py --video_base_dir <input_folder> --output_base_dir <output_folder> [other options]
```

### Video Processing with Qwen2.5-VL
```bash
python brain_predictions/extract_embeddings/videochat_r1.py -v <video_dir> -b <batch_size> -d <output_dir> -p <prompt_number>
```

### Video Processing with LLaVA-Next-Video
```bash
python brain_predictions/extract_embeddings/llava_next_video.py -v <video_dir> -b <batch_size> -d <output_dir> -p <prompt_number>
```

### Video Processing with InstructBLIP
```bash
python brain_predictions/extract_embeddings/instruct_blip_video.py -v <video_dir> -b <batch_size> -d <output_dir> -p <prompt_number>
```

### Video Processing with VideoLLaMA3
```bash
python brain_predictions/extract_embeddings/video-llama.py -v <video_dir> -b <batch_size> -d <output_dir> -p <prompt_number>
```

### Audio Processing with Qwen2.5-Audio
```bash
python brain_predictions/extract_embeddings/qwen2_vl_audio.py -v <audio_dir> -b <batch_size> -d <output_dir> -p <prompt_number>
```

### Model Alignment
```bash
python brain_predictions/align_embeddings/align_model_videos.py -s <subject> -d <base_dir> -m <model_id> -p <prompt_number> -v <videos_folder>
```

### Summarize Results
```bash
python brain_predictions/align_embeddings/summarize_results.py -d <base_dir> -s <subject> -m <model_name>
```

### Generate All Sentences (batch run)
```bash
python brain_predictions/align_embeddings/generate_all_sentences.py
```

---

You can also create a Bash script to run multiple commands. Example:

```bash
#!/bin/bash
python brain_predictions/extract_embeddings/process_wolf_videos.py --video_base_dir ...
python brain_predictions/extract_embeddings/videochat_r1.py -v ...
# Add more commands as needed
```

Replace arguments in <> with your actual paths or values. For more details, see comments in each script.

## Model Information and Hugging Face Links

This project uses several state-of-the-art models for video, audio, and multimodal analysis. Below are the main models referenced in the codebase, along with their Hugging Face links:

| Model Name (Code)                | Hugging Face Model Name                        | Link                                                                 |
|----------------------------------|-----------------------------------------------|----------------------------------------------------------------------|
| Qwen2.5-VL-7B-Instruct           | Qwen/Qwen2.5-VL-7B-Instruct                   | https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct                   |
| Qwen2-Audio-7B-Instruct          | Qwen/Qwen2-Audio-7B-Instruct                  | https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct                  |
| VideoChat-R1_7B                  | OpenGVLab/VideoChat-R1_7B                     | https://huggingface.co/OpenGVLab/VideoChat-R1_7B                     |
| LLaVA-NeXT-Video-7B-hf           | llava-hf/LLaVA-NeXT-Video-7B-hf               | https://huggingface.co/llava-hf/LLaVA-NeXT-Video-7B-hf               |
| InstructBLIP-Vicuna-7B           | Salesforce/instructblip-vicuna-7b             | https://huggingface.co/Salesforce/instructblip-vicuna-7b             |
| VideoLLaMA3-2B                   | DAMO-NLP-SG/VideoLLaMA3-2B                    | https://huggingface.co/DAMO-NLP-SG/VideoLLaMA3-2B                    |
| VideoLLaMA3-7B                   | DAMO-NLP-SG/VideoLLaMA3-7B                    | https://huggingface.co/DAMO-NLP-SG/VideoLLaMA3-7B                    |
| LLaVA-OneVision                  | llava-hf/llava-onevision-qwen2-7b-ov-hf     | https://huggingface.co/llava-hf/llava-onevision-qwen2-7b-ov-hf         |

For more details on each model, see the Hugging Face model card linked above or the comments in the relevant script.

## Results and Visualization

The project generates several types of results:
- Brain alignment visualizations (*.pdf files)
- Model performance metrics (*.npy files)
- Generated text descriptions (stored in pickle files)
- Comparative analysis between different models and brain regions

Key visualization files include:
- Brain normalized alignment barplots for different regions
- Language-narrative linking diagrams
- Model performance comparisons
