# Fact: Teaching MLLMs with Faithful, Concise and Transferable Rationales

## Fact Overview
Fact, a novel paradigm designed to generate multimodal rationales that are faithful, concise, and transferable for teaching MLLMs.

The opaque nature of Multimodal Large Language Models (MLLMs) black-box reasoning processes persists as an enigma, rendering them uninterpretable and struggling with hallucination. Their ability to execute intricate compositional reasoning tasks is also constrained, culminating in a stagnation of learning progression for these models. In this work, we introduce Fact, a novel paradigm designed to generate multimodal rationales that are faithful, concise, and transferable for teaching MLLMs. This paradigm utilizes verifiable visual programming to generate executable code guaranteeing faithfulness and precision. Subsequently, through a series of operations including pruning, merging, and bridging, the rationale enhances its conciseness. Furthermore, we filter rationales that can be transferred to end-to-end paradigms from programming paradigms to guarantee transferability.

<img src="pic/pic1.png"  width="100%">

The pipeline of Fact: 1) Generate executable code from an image and query using a code generation engine and retain code that correctly reasons against expected answers. 2) Simplify code into natural language by pruning irrelevant AST nodes, merging duplicates in symbolic traces, and filling logical gaps to form coherent CoT. 3) Evaluate and filter CoTs for end-to-end model feasibility. 4) Distill refined, accurate CoTs into MLLMs for enhanced adaptability.

## Main Results

We evaluate the performance of two distinct models characterized by varying parameter magnitudes: MiniGPT4, equipped with Vicuna 7B, and OpenFlamingo 3B, after undergoing rationale training and compare them with other pre-train models.

<table>
  <tr>
    <td align="center" rowspan="3">Model</td>
  </tr>
  <tr>
    <td align="center" rowspan="2">Language Model </td>
    <td align="center" rowspan="2">COCO</td>
    <td align="center" rowspan="2">Flickr 30K</td>
    <td align="center" rowspan="2">VQAv2</td>
    <td align="center" rowspan="2">GQA</td>
    <td align="center" rowspan="2">OK-VQA</td>
    <td align="center" colspan="2">TallyQA</td>
  </tr>
  <tr>
    <td align="center">Simple</td>
    <td align="center">Complex</td>
  </tr>
  <tr>
  <td>CosMo (2B)</td>
    <td>OPT-IML-1.8B</td>
    <td>79.9</td>
    <td>51.3</td>
    <td>46.7</td>
    <td>-</td>
    <td>28.3</td>
    <td>-</td>
    <td>-</td>
  </tr>
  <tr>
  <td>Flamingo (3B)</td>
    <td>Chinchilla-1.4B</td>
    <td>73.0</td>
    <td>60.6</td>
    <td>49.2</td>
    <td>-</td>
    <td>41.2</td>
    <td>-</td>
    <td>-</td>
  </tr>
  <tr>
    <td>OpenFlamingo (3B)</td>
    <td>MPT-1B </td>
    <td>74.9</td>
    <td>52.3</td>
    <td>44.6 </td>
    <td>30.1</td>
    <td>28.2</td>
    <td>64.4</td>
    <td>59.3</td>
  </tr>
  <tr>
    <td>OpenFlamingo-Instruct (3B) generalist </td>
    <td>MPT-1B</td>
    <td>79.7 </td>
    <td>53.8</td>
    <td>45.9</td>
    <td>30.9</td>
    <td>30.3</td>
    <td>65.9</td>
    <td>61.8</td>
  </tr>
  <tr>
    <td>OpenFlamingo-Fact (3B) generalist</td>
    <td>MPT-1B</td>
    <td><strong>85.3<strong></td>
    <td><strong>56.6<strong></td>
    <td><strong>49.2<strong></td>
    <td><strong>32.4<strong></td>
    <td><strong>31.8<strong></td>
    <td><strong>70.1<strong></td>
    <td><strong>65.7<strong></td>
  </tr>
  <tr>
    <td>OpenFlamingo-Instruct (3B) specialist</td>
    <td>MPT-1B</td>
    <td>-</td>
    <td>-</td>
    <td>47.7</td>
    <td>32.6</td>
    <td>31.7</td>
    <td>70.1</td>
    <td>66.5</td>
  </tr>
  <tr>
    <td>OpenFlamingo-Fact (3B) specialist </td>
    <td>MPT-1B</td>
    <td>-</td>
    <td>-</td>
    <td><strong>51.0<strong></td>
    <td><strong>35.5<strong></td>
    <td><strong>35.7<strong></td>
    <td><strong>77.6<strong></td>
    <td><strong>68.0<strong></td>
  </tr>
  <tr>
    <td>VL-GPT (7B)</td>
    <td>LLaMA-7B</td>
    <td>116.4</td>
    <td>-</td>
    <td>51.7</td>
    <td>34.6</td>
    <td>35.8</td>
    <td>-</td>
    <td>-</td>
  </tr>
  <tr>
    <td>BLIP-2 (7B) </td>
    <td>Vicuna-7B</td>
    <td>-</td>
    <td>74.9</td>
    <td>65.0</td>
    <td>41.0</td>
    <td>45.9</td>
    <td>-</td>
    <td>-</td>
  </tr>
  <tr>
    <td>MiniGPT4 (7B)</td>
    <td>Vicuna-7B</td>
    <td>99.6</td>
    <td>76.3</td>
    <td>46.9</td>
    <td>34.5</td>
    <td>35.1</td>
    <td>69.5</td>
    <td>60.5</td>
  </tr>
  <tr>
    <td>MiniGPT4-Instruct (7B) generalist</td>
    <td>Vicuna-7B</td>
    <td>105.5</td>
    <td>78.5</td>
    <td>48.2</td>
    <td>34.9</td>
    <td>36.9 </td>
    <td>71.3</td>
    <td>63.8</td>
  </tr>
  <tr>
    <td>MiniGPT4-Fact (7B) generalist</td>
    <td>Vicuna-7B</td>
    <td><strong>116.8<strong></td>
    <td><strong>83.7<strong></td>
    <td><strong>50.8<strong></td>
    <td><strong>36.6<strong></td>
    <td><strong>38.3<strong></td>
    <td><strong>74.4<strong></td>
    <td><strong>66.9<strong></td>
  </tr>
  <tr>
    <td>MiniGPT4-Instruct (7B) specialist</td>
    <td>Vicuna-7B</td>
    <td>-</td>
    <td>-</td>
    <td>51.1</td>
    <td>37.2</td>
    <td>40.6</td>
    <td>75.2</td>
    <td>67.2</td>
  </tr>
  <tr>
    <td>MiniGPT4-Fact (7B) specialist</td>
    <td>Vicuna-7B</td>
    <td>-</td>
    <td>-</td>
    <td><strong>54.2<strong></td>
    <td><strong>39.8<strong></td>
    <td><strong>42.0<strong></td>
    <td><strong>80.7<strong></td>
    <td><strong>71.3<strong></td>
  </tr>

</table>


Examination of the generalist model reveals that post CoT rationale distillation, there is an observable enhancement in general performance, substantiating the hypothesis that MLLMs can indeed derive substantial benefits from such distillation processes. For specialist models, in tasks requiring compositional reasoning, such as GQA, and counting tasks, such as TallyQA, Fact outperformed instruct by 2.6\%, 5.5\%, and 4.1\%, respectively. These results indicate a significant enhancement in the model's understanding of counting and mastery of logic. Such capabilities are largely attributed to the spatial understanding and tool integration abilities provided by high-quality rationales.


## Comparison of MME and POPE benchmark

<table>
  <tr>
    <td align="center" rowspan="3">Model</td>

  </tr>
  <tr>
    <td align="center" rowspan="2">MME</td>
    <td align="center" colspan="3">POPE</td>
  </tr>
  <tr>
    <td align="center">Random</td>
    <td align="center">Popular</td>
    <td align="center">Adversarial</td>
  </tr>
  <tr>
    <td>OpenFlamingo</td>
    <td>668.2</td>
    <td>52.6</td>
    <td>67.2</td>
    <td>56.0</td>
  </tr>
  <tr>
    <td>OpenFlamingo-Instruct generalist</td>
    <td>847.3</td>
    <td>69.5</td>    
    <td>73.1</td>
    <td>68.4</td>
  </tr>
  <tr>
    <td>OpenFlamingo-Fact generalist</td>
    <td>912.2</td>
    <td>73.0</td>
    <td>75.6</td>
    <td>71.5</td>
  </tr>
  <tr>
    <td>MiniGPT4</td>
    <td>581.7 </td>
    <td>43.3</td>
    <td>50.8</td>
    <td>47.9</td>
  </tr>
  <tr>
    <td>MiniGPT4-Instruct generalist</td>
    <td>864.9</td>
    <td>68.3</td> 
    <td>74.4</td>  
    <td>71.2</td>
  </tr>
  <tr>
    <td>MiniGPT4-Fact generalist</td>
    <td>1034.7</td>
    <td>78.7 </td>
    <td>83.7</td> 
    <td>79.1</td>  
  </tr>
</table>

Our additional benchmarks demonstrate that Fact enhances the perception and cognition capabilities of MLLMs and reduces hallucinations by refining rationales to include only objects relevant to the question. This rationale significantly improves the relevance between text and images, showcasing Fact's capacity to direct MLLMs' focus towards pertinent details and thereby increase accuracy. This enhanced focus not only optimizes model performance but also underscores the critical role of tailored rationale design in achieving precise model responses.


## Case Studies
We show several examples of the process that generates CoT rationale for distillation.

<img src="pic/pic2.png"  width="100%">
<img src="pic/pic3.png"  width="100%">
# Fact_program
