# Detached-Skiplink & R-Probe

This repository contains the implementation of **Detached Skiplink** and **R-Probe** for Multimodal Large Language Models (MLLMs), built upon the InternVL architecture.

## Overview

This project explores two main architectural improvements for MLLMs:
1.  **Detached Skiplink**: Enhances visual representation by fusing intermediate layers from the vision encoder with the final embedding. It includes a mechanism to detach gradients, enabling the study of information flow and feature reusability without affecting the backbone updates.
2.  **R-Probe**: An OCR Reconstruction Probe that introduces an auxiliary reconstruction task. This acts as a probe to analyze and improve the model's understanding of fine-grained visual text.

## Repository Structure

The repository is organized as follows:

*   **`internvl/`**: Core source code for models and training.
    *   **`model/`**: Model definitions and architectural modifications.
        *   **Reference Implementation**: The core logic for Detached Skiplink and R-Probe is demonstrated in:
            *   `internvl/model/detached_skiplink/modeling_llama_pe_chat.py`
            *   `internvl/model/r_probe/modeling_llama_pe_chat.py`
        *   **`ablation/`**: Contains backbone implementations (AIMv2, SigLIP2, InternViT).
    *   **`train/`**: Training infrastructure.
        *   `trainer.py`: Contains logic for processing and logging gradient statistics.

*   **`bin/`**: Execution scripts and demonstrations.
    *   `r-probe/`: Scripts for R-Probe experiments.
    *   `skiplink-detach/`: Scripts for Detached Skiplink.

*   **`eval/`**: Evaluation and benchmarking.
    *   Currently provides inference code.
    *   For benchmarks, we recommend using [VLMEvalKit].

*   **`assets/`**: Visualizations and results demonstrating R-Probe effectiveness.

*   **`config/`**: Configuration files.

## Key Features

### Detached Skiplink
The Detached Skiplink architecture allows the MLLM to access intermediate features from the vision encoder. By fusing these multi-scale features, the model gains access to low-level visual details often lost in the final semantic embedding.

### R-Probe (Reconstruction Probe)
R-Probe is an auxiliary head designed to reconstruct image patches (specifically focusing on local regions like text) from the model's hidden states. This serves as both a diagnostic probe and a regularizer to ensure the vision encoder and projector retain sufficient fine-grained information.

## Usage

### Training & Ablation
Sample scripts are provided in the `bin/` directory.

To run Skiplink experiments:
```bash
sh bin/skiplink-detach/S1_baseline.sh
```

To run R-Probe profiling:
```bash
sh bin/r-probe/grad_profile.sh
```

## Implementation Details

*   **Core Logic**: See the `extract_feature` and `forward` methods in the referenced `modeling` files for the exact implementation of the fusion and reconstruction losses.
*   **Gradient Statistics**: See `internvl/train/trainer.py` for details on how gradient artifacts are collected and analyzed during training.


