
**Paper Title:** Polysemous Language Gaussian Splatting via Matching-based Mask Lifting
**Method Name:** MUSplat  
**Submission ID:** 2414

This repository provides the official implementation for **MUSplat**, a training-free paradigm for open-vocabulary understanding of 3D scenes represented by 3D Gaussian Splatting (3DGS).

Our pipeline takes a pre-trained 3DGS model and its corresponding image sequence as input. It operates in three main stages:

1.  **Object-level Grouping**: Lifts multi-view 2D masks to the 3D scene by estimating a foreground probability for each Gaussian point.
2.  **Neutral Point Processing**: Refines object boundaries by identifying and filtering semantically ambiguous points using a combination of cross-view semantic entropy and geometric opacity.
3.  **Instance Feature Extraction**: Leverages a Vision-Language Model (VLM) to generate robust textual features for each object group, enabling precise open-vocabulary querying.

### Highlights ✨

  - **Training-Free & Plug-and-Play**: Directly operates on any pre-trained 3DGS scene without costly retraining or feature optimization.
  - **Polysemous Representation**: Capable of associating a single 3D Gaussian with multiple semantic concepts.
  - **High Precision**: A novel Neutral Point Processing module resolves boundary ambiguities, leading to sharper and more accurate segmentation.
  - **Open-Vocabulary Ready**: VLM-based feature extraction provides robust, view-invariant textual features for flexible language-based queries.

-----

## Environment Setup

We recommend creating a fresh conda environment using the provided specification.

```bash
conda env create -f environment.yml
conda activate musplat
```

If you are using a different environment manager, please ensure the package versions are consistent with `environment.yml`.


-----

## Getting Started: A 3-Step Guide

### Step 1: Data Preparation

Before running the pipeline, ensure you have:

  - A directory containing a pre-trained 3DGS model (`--model_path`).
  - The corresponding dataset directory with the original input images (`--source_path`).

Next, generate multi-view, multi-granularity object masks for the image sequence. The masks must be stored in the following format:

```
{source_path}/masks/
  info.json                # Metadata about the masks
  {object_id}_{image_name}.png
```

  - `info.json`: A JSON file containing metadata. It must include a key `"total_objects": N`, where `N` is the total number of distinct object instances. Example:
    ```json
    {
        "total_objects": 53,
        "images_per_object": 187.06,
        "image_resolution": {
            "width": 985,
            "height": 725
        }
    }
    ```
  - **Mask Filename**: `{object_id}_{image_name}.png`, where `object_id` is a unique identifier for each object instance.

### Step 2: Object Grouping & Neutral Point Processing

This step links 2D masks to 3D Gaussians and refines the resulting object groups.

```bash
python flexirun.py \
  --model_path <path_to_3dgs_model> \
  --source_path <path_to_dataset> \
  --output_path <path_for_results> \
  --iteration <checkpoint_iteration>
```

  - **Outputs**: The script will generate `.pth` files under `<output_path>`, each containing the set of foreground Gaussian indices for a distinct object.

### Step 3: Instance Feature Extraction for Open-Vocabulary Querying

Finally, extract robust textual features for each object group to enable language-based interaction. This is a multi-step process:

1.  **Generate Masked Images**: Run `get_masks.py` to select the top-N largest masks for each object and prepare the corresponding images for VLM input.

    ```bash
    python get_masks.py <path_to_dataset> 
    ```

2.  **Generate Textual Candidates with a VLM**: Due to double-blind review constraints, we do not provide API keys. Please use your own VLM API access. The following prompt was used in our evaluation:

    > In the images, identify the object that is enclosed by a bright green outline. Provide five distinct and appropriate nouns to describe ONLY that specific object. Return ONLY the five nouns separated by slashes (e.g., car/automobile/vehicle/motorcar/transport). Do not add any other explanatory text, titles, or formatting.

    Organize the output into a text file (`<vlm_output_file>.txt`) with the following format:

    ```
    0 Tea/Glass/Cup/Beverage/Drink
    1 Apple/Fruit/Food/Produce/Snack
    ...
    ```

3.  **Encode Text Features with CLIP**: Convert the generated text into CLIP feature vectors.

    ```bash
    python clip/extract_clip_features.py
    ```

4.  **(Example) Match Queries on LERF dataset**: Use the generated features to perform open-vocabulary object selection.

    ```bash
    python render_lerf_by_text.py \
      --model_path <path_to_3dgs_model> \
      --stats_counts_path <path_to_grouping_results> \
      --source_path <path_to_dataset> \
      --scene_name <scene_name_e.g.,_figurines> \
      --total_categories <num_objects> \
      --iteration <checkpoint_iteration>
    ```

-----

## Visualization & Scene Editing

We provide several scripts for advanced visualization and editing.

  - **Stylized Rendering**: Render an object with a specific style (e.g., vibrant colors).

    ```bash
    python style_render_per_class.py \
      --source_path <path_to_dataset> \
      --model_path <path_to_3dgs_model> \
      --iteration <iteration> \
      --output_dir <output_dir> \
      --color_style <style_e.g.,_vibrant>
    ```

  - **Scene Editing (Transformation)**: Translate, scale, or remove an object from the scene.

    ```bash
    python scene_editor.py \
      --source_path <path_to_dataset> \
      --model_path <path_to_3dgs_model> \
      --iteration <iteration> \
      --output_dir <edited_scene_dir> \
      --class_id <object_id_to_edit> \
      --translation <X_val> <Y_val> <Z_val> \
      --scale <X_val> <Y_val> <Z_val>
    ```

