Note: this is the code collection of paper:

**Seeing Realism from Simulation: Efficient Video Transfer for Vision-Language-Action Data Augmentation**

# Project Setup and Usage for coreselection

## Environment

```bash
# Make sure python version == 3.10
conda activate Corset_Sampling
pip install torch==2.1.0 torchvision==0.16.0  --index-url https://download.pytorch.org/whl/cu121
pip install packaging==24.0
pip install ninja
ninja --version; echo $?
pip install flash-attn==2.7.2.post1 --no-build-isolation
pip install -r requirements.txt
```

This guide outlines the three-stage pipeline: generating video embeddings, computing task difficulty, and performing coreset pruning.

## 1. Cosmos Video Embedding

To generate video embeddings using the Cosmos model, follow these steps:

### 1.1 Download Weights

Download the `nvidia/Cosmos-Embed1-448p` model weights and place them into the `cosmos_embedding/model_weight/` directory.

### 1.2 Run Inference

Refer to the arguments in `cosmos_embedding/gen_embedding.py` to run the inference script.

**Directory Structure:**

```text
cosmos_embedding/
├── model_weight/       # Place nvidia/Cosmos-Embed1-448p files here
└── gen_embedding.py
```

---

## 2. Difficulty Computation (RDT)

This step computes the difficulty score (loss) for the dataset using the RDT model.

### 2.1 Preparation

1.  **Download Weights:**
    Download the following models and place them in `difficuty_compute/RDT/weights/`:
    *   `google/t5-v1_1-xxl`
    *   `google/siglip-so400m-patch14-384`
    *   `rdt-1b` (Robotics Diffusion Transformer)

2.  **Configuration:**
    Ensure that the `model_config` and `training_data` directories located in `difficuty_compute/RDT/` are set up strictly following the requirements of the **RoboTwin 2.0** repository.

**Directory Structure:**

```text
difficuty_compute/
└── RDT/
    ├── weights/
    │   ├── t5-v1_1-xxl/
    │   ├── siglip-so400m-patch14-384/
    │   └── rdt-1b/
    ├── model_config/   # Must follow RoboTwin 2.0 specs
    └── training_data/  # Must follow RoboTwin 2.0 specs
```

### 2.2 Execution

Navigate to the RDT directory and run the evaluation script:

```bash
cd difficuty_compute/RDT
bash run_loss_eval_all_tasks.sh
```

**Output:** This will generate the file `difficuty_compute/RDT/rdt_1b_finetune_loss.json`.

---

## 3. Coreset Pruning (D2Pruning)

The final step merges the embeddings and difficulty scores to sample the coreset.

### 3.1 Data Merging

Before running the sampling script, ensure you merge the outputs from Step 1 and Step 2.

*   **Source 1:** `cosmos_embedding/video_embeddings.json`
*   **Source 2:** `difficuty_compute/RDT/rdt_1b_finetune_loss.json`
*   **Target:** `d2pruning/core/data/merged_dataset_input.json`

### 3.2 Run Sampling

Navigate to the `d2pruning` directory and execute the sampling script:

```bash
cd d2pruning

python ./core/data/coreset_sampling.py \
  --input_path "./core/data/merged_dataset_input.json" \
  --output_path "./core/data/coreset.json" \
  --subset_size 2000 \
  --stratas 50 \
  --budget_mode confidence
```

**Arguments:**

*   `--subset_size`: The number of samples to select (e.g., 2000).
*   `--stratas`: Number of difficulty intervals for stratified sampling.
*   `--budget_mode`: Allocation strategy (`confidence` assigns more budget to harder samples).

# Project Setup and Usage for efficient transfer

## Environmental setup 

- First: follow the setup.md in docs from Nvidia Cosmos(we use cosmos transfer 2.5 version 1.3.0 as base DIT).

### cosmos System Requirements

* NVIDIA GPUs with Ampere architecture (RTX 30 Series, A100) or newer
* NVIDIA driver >=570.124.06 compatible with CUDA 12.8.1
* Linux x86-64
* glibc>=2.31 (e.g Ubuntu >=22.04)
* Python 3.10

### cosmos Installation

Clone the repository:

```bash
git clone git@github.com:nvidia-cosmos/cosmos-transfer2.5.git
cd cosmos-transfer2.5
```

Install system dependencies:

[uv](https://docs.astral.sh/uv/getting-started/installation/)

```shell
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
```

Install the package into a new environment:

```shell
uv sync
source .venv/bin/activate
```

Or, install the package into the active environment (e.g. conda):

```shell
uv sync --active --inexact
```

### Downloading  cosmos Checkpoints

1. Get a Hugging Face Access Token with `Read` permission
2. Install [Hugging Face CLI]: `uv tool install -U "huggingface_hub[cli]"`
3. Login: `hf auth login`

## Preprocess
We provide examples code in affixilary_real and affixilary_robotwin, note the directory needs to be set before processing:

- First: use sample.py to get and list videos

- Second: use reset_video.py to reset videos' pixels

## Run 
use the file  depth_video_loop.sh in shell_script to run the full process. It is found that using an A800 80GB can perform the following correctly.
Specific parameters:

- begin: video id that process begins
- end: video id that process ends
- gpu_a: gpu to use
- master_port: port to listen on
- out_dir: directory to output

an example is:

```bash
bash ./shell_script/depth_video_loop.sh --begin 74 --end 90 --gpu_a 7  --master_port 12305 --out_dir out_sampled
```

When running large batch processes, we suggest using an xlsx form to note the processing logs. Additionally, when reconstructing lerobot datasets using augmented videos, it is needed to reform the hdf5 files or parquet files.

## Examples

We provide two generated examples in directory examples_generated.



# Evaluation Notes:

1. The robotwin 2.0 have strict environmental limits, we suggest using NVIDIA 4090 with more than 24GB DRAM. The platform should support vulkan along with sapien. You can find help at robotwin's offical website
2. When evaluating on LIBERO plus and LIBERO, we suggest using at least two A800 80GB for its DRAM  needs.



Additionally, we also provide records from real world experiments in the folder:  real_experiment_records
