# De-fine: Decomposing and Refining Visual Programs with Auto-Feedback

## De-fine Overview

De-fine, a training-free framework that decomposes intricate tasks into executable program blocks by modeling the logic structure of relevant tasks and automatically refines the program based on multifaceted feedbacks from the execution. 

<img src="pic/pic1.png"  width="100%">

De-fine is a programming-based framework that can decompose tasks and refine the program: (1) De-fine first constructs an abstract logical prompt. (2) We generate the program and execute it. (3) During execution, De-fine automatically generates multifaceted feedback for optimizing. (4) De-fine keeps the well-optimized code based on feedback and expands the codebase for future use.

## Main Results

In our experiments, we demonstrated the compositional reasoning and spatial understanding capabilities of De-fine, as reflected in its high scores on visual grounding and logic-intensive VQA datasets, particularly in counting and multi-image tasks. This indicates that De-fine can generate task-specific code based on feedback, effectively addressing the tasks.

### Visual grounding task results

<table>
  <tr>
    <td align="center" rowspan="3">Model</td>

  </tr>
  <tr>
    <td align="center" colspan="2">IoU(%)</td>
  </tr>
  <tr>
    <td align="center">RefCOCO</td>
    <td align="center">RefCOCO+</td>
  </tr>
  <tr>
    <td>GLIP</td>
    <td>55.0</td>
    <td>52.2</td>
  </tr>
  <tr>
    <td>ReCLIP</td>
    <td>58.6</td>
    <td>60.5</td>
  </tr>
  <tr>
    <td>GENOME</td>
    <td>69.2</td>
    <td>-</td>
  </tr>
  <tr>
    <td><strong>De-fine<strong></td>
    <td><strong>75.2<strong></td>
    <td><strong>70.0<strong></td>
  </tr>
</table>

### VQA Results

<table>
  <tr>
    <td align="center" rowspan="3">Model</td>

  </tr>
  <tr>
    <td align="center" colspan="4">Accuracy(%)</td>
  </tr>
  <tr>
    <td align="center">GQA</td>
    <td align="center">OK-VQA</td>
    <td align="center">TallyQA</td>
    <td align="center">NLVRv2</td>
  </tr>
  <tr>
    <td>VISPROG</td>
    <td>50.5</td>
    <td>52.6</td>
    <td>68.1</td>
    <td>62.4</td>
  </tr>
  <tr>
    <td>BLIP-2</td>
    <td>44.7</td>
    <td>45.9</td>    
    <td>48.4</td>
    <td>-</td>
  </tr>
  <tr>
    <td>GENOME</td>
    <td>45.6</td>
    <td>-</td>
    <td>-</td>
    <td>-</td>
  </tr>
  <tr>
    <td>Flamingo</td>
    <td>-</td>
    <td>50.6</td>
    <td>-</td>
    <td>-</td>
  </tr>
  <tr>
    <td>ViperGPT</td>
    <td>49.7</td>
    <td>52.5</td> 
    <td>70.2</td>  
    <td>62.9</td>
  </tr>
  <tr>
    <td>ViperGPT + AL</td>
    <td>52.2</td>
    <td>54.8</td>
    <td>71.7</td> 
    <td>64.0</td>  
  </tr>
  <tr>
    <td><strong>De-fine<strong></td>
    <td><strong>55.3<strong></td>
    <td><strong>57.1<strong></td>
    <td><strong>73.2<strong></td>
    <td><strong>67.3<strong></td>
  </tr>
</table>

## Case Studies

We showcase some examples of how De-fine dynamically refines programs by systematical feedback across various modalities.

<img src="pic/pic2.png"  width="100%">
<img src="pic/pic3.png"  width="100%">
<img src="pic/pic4.png"  width="100%">
<img src="pic/pic5.png"  width="100%">


## Feedback Example & Error Cases

<img src="pic/n172618.jpg"  width="50%">

- On the GQA task, for the image named n172618.jpg, with the question “What is the height of the girl that is holding the kite?”, De-fine generated the following code:

```:
def execute_command(image):
    #Step1: Find the patch containing the girl
    girl_patches = ImagePatch(image).find("girl")
    #Step2: Find the girl holding the kite
    kite_patches = ImagePatch(image).find("kite")
    kite_patch = kite_patches[0]
    girl_holding_kite_patches = []
    for girl_patch in girl_patches:
        if girl_patch.overlaps_with(kite_patch.left, kite_patch.lower, kite_patch.right, kite_patch.upper):
            girl_holding_kite_patches.append(girl_patch)
    #Step3: Calculate the height of the girl holding the kite
    girl_holding_kite_patch = girl_holding_kite_patches[0]
    height = girl_holding_kite_patch.upper - girl_holding_kite_patch.lower
    return str(height) + "pixels"
```

De-fine generated the following feedback:

- Visual Feeback: The young girl on the right in vibrant outdoor clothing holds onto a colorful kite, preparing to catch the wind on a lively day in the field. The young adventurer on the left in a pink sweater encounters the wonders of nature and the images you’ve provided do seem like they could be the result of an operation that extracts a specific area or "patch" from a larger image
- Textual Feeback: There is a kite. There is a child holding the kite in her hand. The height of the child holding the kite is 355 pixels. Those sub-steps could be part of a process to estimate the height of the girl holding the kite.
- Compile Feeback: Execution succeed!

Although the code executed correctly and the logic was sound, the result was incorrect when compared to the expected answer “Tall”. This is because the question could not be adequately transformed into a classification problem of tall or short.

<img src="pic\COCO_val2014_000000029393.jpg"  width="50%">

- On the OK-VQA task, regarding the image COCO_val2014_000000029393.jpg with the question, “What type of dog is this?”, De-fine produced the following code:

```:
def execute_command(image):
    #Step1: Create an ImagePatch object from the image
    image_patch = ImagePatch(image)
    #Step2: Use simple_query method to get the answer
    dog_breed = image_patch.simple_query("What type of dog is this?")
    #Step3: Return the answer
    return dog_breed
```

De-fine generated the following feedback:

- Visual Feeback: A vigilant dog stands guard on a sunlit garden deck, with a lemon tree backdrop. From the context provided, it seems like these images could indeed be part of such a picture.
- Textual Feeback: The breed of the dog is dachshund. But those steps alone are insufficient to determine the breed of a dog from an image. A specialized image recognition model would be required.
- Compile Feeback: Execution succeed!


The error in this instance was related to the use of the Blip tool in the simple_query function, which returned the answer "a dachshund". However, when we replaced the function called by simple_query with GPT-4V, "Rhodesian Ridgeback" was obtained. Although the label is "hound".

## Acknowledgment
Thanks to the open source of the following projects:
+ [CLIP](https://github.com/openai/CLIP)
+ [GLIP](https://github.com/microsoft/GLIP)
+ [BLIP2](https://huggingface.co/Salesforce/blip2-opt-2.7b-coco)
+ [LAVIS](https://github.com/salesforce/LAVIS)
+ [Visual Programming](https://github.com/allenai/visprog)
+ [ViperGPT](https://github.com/cvlab-columbia/viper)# De-fine_Program
