
## Efficient Vision-and-Language Navigation

This repository contains the code for reproducing the results of our paper:

- Harnessing Input-adaptive Inference for Efficient Vision-and-Language Navigation

&nbsp;

----

### TL;DR

We present a novel input-adaptive inference method for efficient vision-and-language navigation.
;

### Abstract

An emerging paradigm in vision-and-language navigation (VLN) is the use of
history-aware multi-modal transformer models. Given a language instruction, these
models take observation and history as input and predict the most appropriate
action for an agent. While employing these models has significantly improved
performance, the scale of these models can be a bottleneck in practical settings
where computational resources are limited (e.g., in robots). In this work, we present
a novel input-adaptive navigation method for efficient VLN. We first characterize
the overthinking problem in VLN and show that none of the existing input-adaptive
mechanisms successfully reduce overthinking without causing significant per-
formance degradation. Our method addresses this problem by developing three
adaptive algorithms deployed at different levels: (1) We develop an adaptive ap-
proach that improves spatial efficiency; we only process a subset of panoramic
views at each observation of an agent. (2) We also achieve model-level efficiency
by developing adaptive thresholding for the early-exit method we employ, based
on the importance of each view in navigation. (3) To achieve temporal efficiency,
we design a caching mechanism to avoid processing views that an agent has seen
before. In evaluations with six VLN benchmark tasks, we demonstrate over a 2X reduction in computation with approximately 10% performance degradation.

&nbsp;

----

## Prerequisites

We implement our method using the original implementation of HAMT and DUET.


#### HAMT

Please follow the installation instructions provided in the [HAMT repository](https://github.com/cshizhe/VLN-HAMT/tree/c8b9ee12125f9fe36c51d2ab928fde38f7d846bd) to set up the required environment.

Create `panoimages.lmdb` by running the script located at `Efficient_VLN/HAMT/preprocess/build_image_lmdb.py`.

Install `thop` for GFLOPs calculation:

```
    $ pip install thop
```

Then, modify the profile function in `thop/profile.py` as follows:

```
def profile(
    model: nn.Module,
    inputs,
    custom_ops=None,
    verbose=True,
    ret_layer_info=False,
    report_missing=False,
):
    handler_collection = {}
    types_collection = set()
    if custom_ops is None:
        custom_ops = {}
    if report_missing:
        # overwrite `verbose` option when enable report_missing
        verbose = True

    def add_hooks(m: nn.Module):
        m.register_buffer("total_ops", torch.zeros(1, dtype=torch.float64))
        m.register_buffer("total_params", torch.zeros(1, dtype=torch.float64))

        # for p in m.parameters():
        #     m.total_params += torch.DoubleTensor([p.numel()])

        m_type = type(m)

        fn = None
        if m_type in custom_ops:
            # if defined both op maps, use custom_ops to overwrite.
            fn = custom_ops[m_type]
            if m_type not in types_collection and verbose:
                print("[INFO] Customize rule %s() %s." % (fn.__qualname__, m_type))
        elif m_type in register_hooks:
            fn = register_hooks[m_type]
            if m_type not in types_collection and verbose:
                print("[INFO] Register %s() for %s." % (fn.__qualname__, m_type))
        else:
            if m_type not in types_collection and report_missing:
                prRed(
                    "[WARN] Cannot find rule for %s. Treat it as zero Macs and zero Params."
                    % m_type
                )

        if fn is not None:
            handler_collection[m] = (
                m.register_forward_hook(fn),
                m.register_forward_hook(count_parameters),
            )
        types_collection.add(m_type)

    prev_training_status = model.training

    model.eval()
    model.apply(add_hooks)

    with torch.no_grad():
        if isinstance(inputs, tuple):
            output = model(*inputs)
        else:
            output = model(**inputs)

    def dfs_count(module: nn.Module, prefix="\t") -> (int, int):
        total_ops, total_params = module.total_ops.item(), 0
        ret_dict = {}
        for n, m in module.named_children():
            # if not hasattr(m, "total_ops") and not hasattr(m, "total_params"):  # and len(list(m.children())) > 0:
            #     m_ops, m_params = dfs_count(m, prefix=prefix + "\t")
            # else:
            #     m_ops, m_params = m.total_ops, m.total_params
            next_dict = {}
            if m in handler_collection and not isinstance(
                m, (nn.Sequential, nn.ModuleList)
            ):
                m_ops, m_params = m.total_ops.item(), m.total_params.item()
            else:
                m_ops, m_params, next_dict = dfs_count(m, prefix=prefix + "\t")
            ret_dict[n] = (m_ops, m_params, next_dict)
            total_ops += m_ops
            total_params += m_params
        # print(prefix, module._get_name(), (total_ops, total_params))
        return total_ops, total_params, ret_dict

    total_ops, total_params, ret_dict = dfs_count(model)

    # reset model to original status
    model.train(prev_training_status)
    for m, (op_handler, params_handler) in handler_collection.items():
        op_handler.remove()
        params_handler.remove()
        m._buffers.pop("total_ops")
        m._buffers.pop("total_params")

    if ret_layer_info:
        return total_ops, total_params, ret_dict
    return output, total_ops, total_params

```

#### DUET

We have referred to [this](https://github.com/cshizhe/HM3DAutoVLN/blob/main/hm3d_data_gen/step02_extract_view_features.py)
for implementing the Vision Transformer for DUET.
After installing `timm`, you will need to modify the `vision_transformer.py` file inside the `timm` package. 
Replace it with the file located at: `Efficient_VLN/HAMT/preprocess/ViT/vision_transformer.py`

&nbsp;

----

## Run Our Input-adaptive Inference Method

To run our input-adaptive inference method, execute the following command:

```
    $ sh finetune_src/scripts/run_r2r.sh
```

Also you need to specify the following flags:
* `--img_db_file {path to your lmdb}`: Provide the path to your LMDB.
* `--batch_size 1`: Always set the batch size to 1.
* `--cache {False/True}`: If you want to use cached image features, set to True. To process through ViT (our method), set to False.
* `--mode {baseline/efficient}`: To run the our input-adaptive inference, set this flag to efficient.
  
&nbsp;

----

## Run Our Method Under Various Visual Corruptions

First, create a corrupted `panoimages.lmdb` by running the following command:

```
    $ python Efficient_VLN/HAMT/preprocess/visual_corruption/build_image_lmdb_degradation_script.py \
    --output_dir={path} \
    --visual_degradation={choose from: 'lighting', 'motion_blur', 'speckle_noise', 'spatter', 'defocus_blur'}
```

Then, replace the path to the obtained corrupted `panoimages_{degradation_type}.lmdb` in your run_r2r.sh script to use the corrupted data.

