\appendix
\newpage

\section*{Appendix}

\section{Literature Review on MLLM Evaluation and Rationale for Selected Evaluation Aspects} \label{rationale-for-selected-EAs}

Evaluating MLLMs is a critical task that determines their capabilities, limitations, and applicability across various domains. Various studies \citep{li2024llava, wu2025controlmllm, yu2023mmvet} have employed a wide range of evaluation aspects depending on their focus, whether it be perceptual understanding, compositional reasoning, multimodal alignment, or task-specific problem-solving. Unlike traditional unimodal models, MLLMs require evaluation frameworks that consider their ability to process, integrate, and generate outputs from different modalities, such as text, images, and, in some cases, audio and video. A review of existing studies \citep{nwae403} shows that evaluations are primarily categorized into general multimodal understanding, task-specific assessments, and trustworthiness metrics. These evaluations are particularly important for ensuring that MLLMs can transition from general-purpose reasoning to task-specific applications, making them more reliable for real-world deployments. This section discusses commonly used evaluation aspects in MLLM research and motivated us to select the evaluation aspect those have wider impact in various applications in MLLM models. 

Various studies \citep{cai2024vip, chen2024internvl, li2024llava, wu2025controlmllm, yu2023mmvet} have approached evaluation through distinct lenses, focusing on foundational multimodal capabilities and task-specific applications. One of the primary evaluation aspects in existing research is perception and object recognition, where models are tested on their ability to identify objects, classify images, and interpret visual attributes. Benchmarks such as VQA v2, COCO, and ImageNet have been widely used to assess how well models recognize and describe elements within an image, ensuring that they correctly map visual features to textual outputs \citep{fu2024mme}. Similarly, scene understanding is another crucial dimension that determines whether models can accurately describe complex environments, capturing spatial relationships and multiple interacting objects, as tested in datasets such as GQA and VizWiz \citep{nwae403}.

Another key area of evaluation in existing studies is multimodal reasoning and alignment, which assesses how well MLLMs integrate information across different modalities to derive meaningful conclusions \citep{wang2024exploring}. The ability to perform multimodal chain-of-thought (CoT) reasoning has gained increasing attention, as models must generate structured responses by logically sequencing multiple inputs from different modalities\citep{nwae403}. Studies \citep{tong2024eyes, zhou2025aretheythesame} have found that current MLLMs, despite their strong textual reasoning abilities, still struggle to connect visual cues with textual prompts in a coherent manner, limiting their reliability in domains requiring step-by-step logical processing. Additionally, multimodal reasoning is closely tied to vision-language alignment, which evaluates how effectively models fuse textual and visual data to generate accurate and context-aware outputs. Research \citep{fu2024mme} indicates that poor alignment often leads to failures in high-level reasoning tasks, as models may incorrectly interpret visual information when translating it into text-based responses. 

Beyond fundamental perception and reasoning, some studies have investigated task-specific and real-world applications of MLLMs \citep{li2024surveying, wang2024comprehensive}. One of the most prominent applications is medical image analysis, where models are tested on their ability to assist in diagnosing conditions from X-rays, MRIs, and other medical scans \citep{zhang2024generalist, royer2024multimedeval, li2023llava}. In this domain, specialized benchmarks such as RadBench and PMC-VQA have been used to evaluate the accuracy of MLLM-generated medical insights \citep{nwae403}. Another critical application is autonomous driving and environmental scene interpretation \citep{shi2023exploring, sima2024drivelm}, where models process real-world driving scenarios by integrating road signs, obstacles, and pedestrian movements into decision-making processes. Datasets such as NuScenes-QA \citep{qian2024nuscenes} and BDD-X \citep{kim2018textual} serve as standard benchmarks in this space, ensuring that models can reliably analyze traffic conditions and provide accurate assessments of dynamic environments \citep{nwae403}. 

Some studies have explored multi-round QA and instruction following \citep{fu2024mme}, where models are tested on how well they can retain contextual information across conversational turns. Research indicates that existing MLLMs often exhibit context drift, failing to maintain coherence when responding to sequential queries, which is a key challenge in developing effective AI-powered assistants \citep{nwae403}. Furthermore, aspects related to bias and fairness have also been examined in some studies \citep{zhang2024benchmarking, li2024red}, as concerns grow about MLLMs inheriting and amplifying societal biases from their training data. Studies \citep{adler2024gpt, anthropic2024claude} stand out as the most ethically aligned models, showing high accuracy and a strong ability to refuse ethically questionable prompts. Fairness evaluation often involves testing for gender, racial, and regional biases, using datasets such as Multi-Trust \citep{zhang2024benchmarking} and VLBiasBench \citep{wang2024vlbiasbench} to assess whether models generate discriminatory or skewed responses \citep{nwae403}.

Another critical aspect in evaluating MLLMs is their robustness and reliability \citep{dang2024exploring}, particularly in minimizing hallucinations \citep{bai2024hallucination}, a phenomenon where models generate factually incorrect or fabricated outputs. Studies assessing hallucination rates have shown that MLLMs, particularly in complex reasoning tasks, tend to produce confident but incorrect statements, making them unreliable for high-stakes applications such as medical diagnosis and financial forecasting \citep{bai2024hallucination}. Efforts to measure and mitigate hallucinations have led to the development of evaluation frameworks that focus on trustworthiness and factual consistency, ensuring that models generate responses that are not only contextually relevant but also grounded in accurate information \citep{yin2024woodpecker}.

Overall, the breadth of evaluation aspects considered in MLLM studies reflects the increasing complexity of these models and the growing need for rigorous assessment frameworks. While some studies focus on fundamental perception and reasoning abilities, others prioritize task-specific applications, conversational capabilities, and fairness concerns. However, despite the diverse range of evaluation methodologies, challenges remain in developing standardized benchmarks that comprehensively measure MLLM performance across all modalities and tasks. This highlights the importance of selecting well-defined evaluation aspects that balance theoretical rigor and real-world applicability, which is a key motivation behind the selection of four core evaluation aspects in our study.

Despite these advancements, researchers still face challenges in defining standardized, reproducible evaluation aspects. Many existing methods rely on task-specific datasets or closed-set evaluations, limiting the generalizability of their findings. There is an increasing need for comprehensive, multi-domain evaluation frameworks that can assess models across perception, reasoning, task execution, and ethical considerations, ensuring that MLLMs are robust, interpretable, and trustworthy for real-world applications \citep{fu2024mme}.

While audio-visual evaluations are critical for advancing MLLMs, current models still exhibit substantial limitations in handling high-dimensional temporal data, making standardized evaluations difficult\citep{fu2024mme}. Audio and video-based evaluation are not considered in our study, driven by the need for a controlled evaluation framework that minimizes confounding factors related to temporal dependencies, data preprocessing, and model specialization \citep{nwae403}. 

Given the wide range of evaluation aspects discussed in existing research, our study selects four core aspects \ref{modelEAs}: Reasoning and Compositionality, Multimodal Understanding and Alignment, Complex Code Generation and Execution, and Knowledge Retrieval and Integration, due to their direct relevance in assessing task-specific performance and a wide range of real-world applicability and use-cases of MLLMs. 

\section{Detailed Review of Key MLLMs}\label{apdx_review_mllm}

This section provides a comprehensive review of recent advancements in Multimodal Large Language Models (MLLMs), covering both proprietary and open-source developments. The discussion highlights their architectural designs, multimodal processing capabilities, and evaluation suitability, which informed our selection of models for this study.

\subsection{Proprietary MLLMs}

Proprietary models such as OpenAI's GPT-4o and GPT-4.5 Preview \citep{OpenAI2025Models}, Anthropic’s Claude 3 (including variants like Claude 3 Opus, Sonnet, and Haiku) \citep{anthropic2024claude}, and Google's Gemini series \citep{DeepMind2025Gemini} have set high standards in multimodal reasoning. For example, GPT-4 introduced multimodal inputs that enable the processing of both text and images, thereby improving instruction-following and contextual understanding. Similarly, the Claude 3 series is optimized for text–image reasoning tasks, leveraging large-scale datasets and fine-tuned architectures to achieve superior generalization and task flexibility. However, the closed-source nature of these models limits customization, transparency, and independent research adaptability.

\subsection{Open-Source MLLMs}

In response to the limitations of proprietary systems, the open-source ecosystem has rapidly evolved, providing models with enhanced accessibility and community-driven improvements. Several high-performance models have emerged:

\paragraph{ChatRex \citep{jiang2024chatrex}}  
ChatRex is designed as a decoupled MLLM that bridges perception and understanding in vision-language tasks. By employing a retrieval-based approach with a Universal Proposal Network (UPN), it improves visual grounding for tasks like object detection and region-based question answering. While it excels in object-level perception and spatial awareness, its specialization limits its adaptability for abstract multimodal reasoning tasks.

\paragraph{VITA \citep{fu2024vita}}  
VITA is a multimodal interactive model that handles video, image, text, and audio modalities in real time. It extends Mixtral 8×7B to enhance bilingual capabilities and integrates specialized encoders for vision and audio. Its duplex interaction pipeline enables continuous environmental monitoring and multi-turn conversations, though it is less optimized for structured vision-language reasoning such as code synthesis.

\paragraph{Long-VITA \citep{shen2025long}}  
Long-VITA targets long-context visual-language understanding, capable of processing up to 1 million tokens. It uses a multi-stage training process that includes long-sequence fine-tuning to maintain coherence over extended interactions. While it excels in video comprehension and document-level understanding, its lack of strict data filtering can impact response consistency.

\paragraph{Meteor \citep{lee2025meteor}}  
Meteor employs a rationale traversal approach to enhance multimodal reasoning without significantly increasing model size. By leveraging multifaceted rationale embeddings, it improves vision-language understanding and step-by-step reasoning. However, its performance is sensitive to the availability of rationale-enhanced pretraining datasets and may struggle with tasks that require high-resolution image processing.

\paragraph{MoAI \citep{lee2024moai}}  
MoAI utilizes a Mixture of Experts (MoE) approach by integrating external computer vision models for panoptic segmentation, object detection, and OCR. Its two-stage pipeline, comprising a MoAI-Compressor and MoAI-Mixer allows for detailed object detection and structured text reasoning. While effective in high-precision tasks, the dependency on external modules increases computational overhead.

\paragraph{ViP-LLaVA \citep{cai2024vip}}  
ViP-LLaVA enhances visual reasoning by overlaying visual markers (e.g., bounding boxes, arrows) directly onto input images, facilitating region-specific interactions. Built on a Vicuna v1.5 language backbone and CLIP-336px vision encoder, it excels in object localization and context-aware reasoning. Its training process involves multiple stages, including BLIP-captioned image-text pairs and GPT-4V instruction data.

\paragraph{Falcon2-11B \citep{malartic2024falcon2}}  
Falcon2-11B is an efficient foundation model with 11 billion parameters, optimized for text-based tasks and extended to multimodal functions via a CLIP-based vision encoder. Although it exhibits strong long-context reasoning, its vision-language variant may be less competitive in tasks requiring fine-grained visual comprehension.

\paragraph{Cambrian-1 \citep{tong2025cambrian}}  
Cambrian-1 is designed as a vision-centric MLLM using multiple vision encoders (e.g., OpenAI CLIP ViT-L/14@336, SigLIP ViT-SO400M, DINOv2 ViT-L/14) integrated via a Spatial Vision Aggregator (SVA). Available in various parameter scales (8B, 13B, 34B), it excels in OCR and chart-based tasks while maintaining efficiency through reduced visual token usage.

\paragraph{MiniCPM \citep{yao2024minicpm}}  
MiniCPM is a lightweight model focused on efficient processing of vision-language tasks. Despite its compact architecture, it performs competitively on fundamental multimodal reasoning tasks, though its capacity for complex, long-form content is limited.

\paragraph{mPLUG-Owl3 \citep{ye2024mplug}}  
mPLUG-Owl3 specializes in long image-sequence understanding and video-based multimodal processing using Hyper Attention Transformer Blocks (HATB). It is tailored for tasks requiring temporal coherence but is less suited for structured code generation and fine-tuned knowledge retrieval.

\paragraph{Molmo \citep{deitke2024molmo}}  
Molmo emphasizes transparency by being trained on the open PixMo dataset. It excels in fine-grained vision-language understanding and visual grounding tasks, though it is not optimized for structured tasks such as code execution or prompt engineering.

\paragraph{Additional Open-Source Models}  
Other notable models include InternVL2-1B \citep{chen2024internvl}, Qwen2-VL-2B-Instruct \citep{yang2024qwen2}, MiniMonkey \citep{huang2024mini}, Paligemma-3B-mix-448 \citep{beyer2024paligemma}, Phi-3.5 VLM \citep{abdin2024phi}, LLaVA OneVision-7B \citep{li2024llava}, Ovis 1.5-Llama 3-8B \citep{lu2024ovis}, GLM-4v-9B \citep{glm2024chatglm}, Ovis1.6 \citep{lu2024ovis}, Llama3.2-Vision \citep{meta2024llama32}, Pixtral \citep{agrawal2024pixtral}, OmChat-V2 \citep{omchat2024v2}, and InternVL2-26B \citep{chen2024far, internvl2024}. Each of these models is characterized by distinct combinations of language and vision encoders (e.g., Qwen, Gemma, Llama, GLM; ViT, SigLIP, CLIP, EVA) and varying parameter counts, offering a wide spectrum of capabilities from lightweight, efficient inference to high-precision multimodal reasoning. This comprehensive review highlights the diverse strategies in multimodal model design and evaluation, which underpin the criteria for our model selection. 

\section{Detailed Review of Prompt Engineering Methods}
\label{apdx_prompt_methods}

This appendix provides an in-depth review of the seven prompting techniques employed in this study. For each method, we discuss the underlying principles, key features, and provide example templates along with usage scenarios in both unimodal and multimodal contexts.

\subsection{Zero-Shot Prompting}
Zero-shot prompting involves providing the model with only the task description, relying entirely on its pre-trained knowledge \citep{radford2019language}. This method simplifies prompt design and reduces computational overhead, making it ideal for general-purpose tasks.
\begin{figure}[H]
\centering
\includegraphics[width=0.5\textwidth, angle=0]{Figures_apdx/ZS.png}
\caption{Zero-shot Prompting Syntax}
\label{fig_zsp_app}
\end{figure}

\subsection{One-Shot Prompting}
One-shot prompting includes a single example alongside the task description to direct the model toward the desired output \citep{mann2020language}. This method provides minimal contextual guidance, balancing efficiency and accuracy for moderately complex tasks.
\begin{figure}[H]
\centering
\includegraphics[width=0.5\textwidth, angle=0]{Figures_apdx/OS.png}
\vspace{-1mm}
\caption{One-shot Prompting Syntax}
\vspace{-4mm}
\label{fig_osp_app}
\end{figure}
\vspace{-5mm}
\subsection{Few-Shot Prompting}
\vspace{-2mm}
Few-shot prompting incorporates multiple examples to establish clear input-output patterns, which is particularly useful for tasks that require structured responses \citep{mann2020language}.
\begin{figure}[H]
\centering
\includegraphics[width=0.5\textwidth, angle=0]{Figures_apdx/FS.png}
\vspace{-1mm}
\caption{Few-shot Prompting Syntax}
\vspace{-4mm}
\label{fig_fsp_app}
\end{figure}
\vspace{-5mm}
\subsection{Chain-of-Thought (CoT) Prompting}
\vspace{-2mm}
Chain of Thought prompting encourages models to decompose problems into intermediate reasoning steps, thereby improving logical progression and accuracy in complex tasks \citep{wei2022chain, kojima2022large, zhang2022automatic}.
\begin{figure}[H]
\centering
\includegraphics[width=0.5\textwidth, angle=0]{Figures_apdx/CoT.png}
\caption{Chain-of-Thought (CoT) Prompting Syntax}
\label{fig_cot_app}
\end{figure}

\subsection{Analogical Prompting}
\vspace{-3mm}
Analogical prompting utilizes analogous examples that closely align with the task's requirements to foster indirect reasoning and creativity. This method enables models to transfer knowledge based on structural similarities between scenarios \citep{yasunaga2023large, lu2021fantastically, wu2022self, guo2024can}.
\vspace{-3mm}
\begin{figure}[H]
\centering
\includegraphics[width=0.44\textwidth, angle=0]{Figures_apdx/Anl.png}
\vspace{-2mm}
\caption{Analogical Prompting Syntax}
\label{fig_ana_app}
\end{figure}
\vspace{-5mm}
\subsection{Generated Knowledge Prompting}
\vspace{-3mm}
Generated Knowledge Prompting involves prompting the model to generate additional task-relevant background knowledge, which is then used to improve reasoning and decision-making \citep{liu2021generated, liu2023pre}. This technique enriches the input context, leading to improved output accuracy.
\vspace{-3mm}
\begin{figure}[H]
\centering
\includegraphics[width=0.44\textwidth, angle=0]{Figures_apdx/GK.png}
\vspace{-2mm}
\caption{Generated Knowledge Prompting Syntax}
\label{fig_genk_app}
\end{figure}


\subsection{Tree-of-Thought (ToT) Prompting}
Tree of Thought Prompting extends the Chain of Thought framework by organizing reasoning into a decision tree. This structure allows the model to explore multiple reasoning paths before converging on a final solution \citep{yao2024tree}, making it particularly effective for exploratory and decision-making tasks.
\begin{figure}[H]
\centering
\includegraphics[width=0.5\textwidth, angle=0]{Figures_apdx/ToT.png}
\caption{Tree-of-Thought Prompting Syntax}
\label{fig_tot_app}
\end{figure}

\section{Task Design Methodology for Evaluation Aspects}
This appendix section provides a detailed account of the task design methodology for each of the four Evaluation Aspects (EA1-EA4). Each subsection outlines the specific objectives, rationale, and design principles used to create the tasks, with a focus on probing different multimodal capabilities of MLLMs. EA1 focuses on reasoning and compositionality, EA2 on multimodal understanding and alignment, EA3 on code generation from visual inputs, and EA4 on integrating contextual knowledge with multimodal cues. Together, these subsections offer a comprehensive view of how the tasks were systematically designed to simulate real-world multimodal scenarios and evaluate model performance across diverse cognitive and functional dimensions.


\subsection{EA1 Task Design Methodology}
\label{apdx_ea1}

The tasks in the study were systematically crafted to evaluate the reasoning and compositionality abilities of MLLMs. Each task was designed with a specific objective and level of complexity, and the accompanying visuals were tailored to challenge the models' multimodal reasoning capabilities. Below, we outline the thought process behind designing the tasks in Evaluation Aspect 1 (Reasoning and Compositionality) and their corresponding visual aids. 

The images for each task were crafted to align with the task objectives while enhancing multimodal reasoning. They provide sufficient context for reasoning without overloading unnecessary details. Every visual element (such as shapes, colors, or character actions) directly contributes to the reasoning challenge. Additionally, the visual style across tasks maintains a professional yet engaging appearance, aiding comprehension and focus.
The design of these tasks reflects a deliberate effort to test different dimensions of reasoning and compositionality in MLLMs. By combining well-structured multimodal inputs with incrementally complex challenges, the study ensures that the evaluation methodology is robust, comprehensive, and replicable.

The design of these tasks reflects a deliberate effort to test different dimensions of reasoning and compositionality in MLLMs. By combining well-structured multimodal inputs with incrementally complex challenges, the study ensures that the evaluation methodology is robust, comprehensive, and replicable.

For Task 1 (Pattern Recognition in Visual Sequences), our objective was to assess the model's ability to identify and extrapolate logical patterns across multiple modalities including numbers, shapes, and colors. The task integrates visual and textual elements, with images depicting sequences of numbers, geometric shapes, and colors. The pattern increases in complexity by combining arithmetic progression, geometric reasoning, and ambiguous color sequences. The sequences provide sufficient data points for models to deduce the logical rule while introducing ambiguity in color patterns to test reasoning flexibility. A carefully designed image illustrates the sequence, ensuring clarity and challenge.
In Task 2 (Logical Deduction from Text and Simplified Diagram), we aimed to evaluate deductive reasoning by combining textual descriptions and a simplified diagram of a real-world scenario. The task requires the model to synthesize text and visual clues (such as a broken window and a football near the window) to deduce the most likely event. A relatable scenario involving siblings and hobbies was chosen to simulate real-life reasoning challenges. Conflicting clues were introduced to evaluate the model's ability to prioritize relevant evidence. A clear, engaging image illustrates the suspects and the scene, making the task visually intuitive.
Task 3 (Mathematical Puzzle with Visual Data) was designed to test numerical reasoning and trend identification by combining tabular and graphical representations of data. Sales figures across four quarters provide a structured data source, while a bar chart complements the table, allowing for cross-referencing between modalities. The task requires models to calculate totals, identify trends, and predict future performance based on observed data. The bar chart and table are designed for easy interpretation while presenting enough complexity to challenge reasoning skills.
Finally, Task 4 (Story Synthesis from Text and Image) evaluates narrative synthesis by requiring models to generate coherent continuations from textual and visual inputs. The task integrates a textual scenario about a science fair and an image of a student with a renewable energy project to provide rich input. The visual representation of characters and their environment helps models establish a narrative context. The task encourages the model to infer plausible events and outcomes based on the given details. The image highlights key elements of the narrative, ensuring alignment with the text.


\subsection{EA2 Task Design Methodology}
\label{apdx_ea2}
The tasks in EA2 (Multimodal Understanding and Alignment) were developed to assess how well MLLMs integrate and align information from diverse modalities, such as images, text, and charts. Each task was designed to evaluate a distinct dimension of multimodal reasoning, such as matching, inference, translation, and consistency detection.
The design approach for EA2 tasks involves real-world multimodal challenges inspired by scenarios where visual and textual data coexist, such as articles with accompanying images and graphs with interpretations. Tasks progress from simpler matching exercises to more abstract reasoning and detailed cross-modal verification, incorporating incremental complexity. Each task requires the model to go beyond basic alignment by explaining its reasoning, testing both understanding and coherence.
The images for EA2 tasks were carefully created to enhance multimodal alignment, providing clear and meaningful data and ensuring that every detail contributes to the task's objective. While maintaining visual richness, the images avoid unnecessary complexity, allowing the focus to remain on reasoning and interpretation. The visuals simulate realistic situations, such as kayaking in nature, abstract art, and unemployment graphs, to make tasks relatable and engaging. Each visual is paired with textual or graphical content in a way that highlights dependencies between modalities.

The objective of Task 1 (Image-Text Matching and Explanation) is to evaluate the model's ability to match visual scenes with corresponding textual descriptions and provide reasoning for each match. The task integrates visual and textual modalities, requiring the model to align content based on key characteristics such as kayaking, manufacturing, and cooking. The model must discern details in each image and accurately match them to abstract textual concepts, testing interpretive and alignment skills. By providing explanations for each match, the task ensures the model demonstrates understanding rather than guessing. Clear and detailed images are used to depict diverse scenes (adventure, industry, and culinary arts), making the task visually intuitive while challenging. Explanation prompts ensure the reasoning process is transparent and logical. Task 2 (Inferring Context from Combined Modalities) aims to assess the model's ability to infer additional context by synthesizing textual and visual information. The task combines a descriptive paragraph (e.g., Maria walking on an empty street at dusk) with an image (a clock tower in a quiet city square) to test the model's ability to infer time, setting, and motivations. The model must analyze textual hints (e.g., streetlights and vendors closing) and visual cues (e.g., clock showing 7:30 PM) to arrive at accurate conclusions. The task introduces subtle contextual clues, requiring the model to resolve ambiguity using logical inference. The image is designed to evoke a specific time and atmosphere, enhancing the model's ability to integrate visual and textual data. 
The objective of Task 3 (Cross-Modal Translation) is to evaluate the model's ability to interpret abstract visual art and translate it into coherent literary themes or narratives. The abstract painting (depicting turbulent seas under sunlight) challenges the model to interpret artistic elements (colors, patterns, and mood) and map them to poetic themes. The task emphasizes translating visual impressions into descriptive language, requiring creative reasoning. By interpreting the painting's mood (e.g., chaotic yet hopeful), the model's ability to infer emotional tones from visuals is tested. The abstract painting is detailed and expressive, providing ample cues for interpretation while leaving room for subjective reasoning.
Task 4 (Aligning Data from Charts and Text) tests the model's capability to detect inconsistencies between data presented in a visual chart and textual descriptions. The task involves analyzing a line graph (e.g., unemployment rates) and cross-referencing it with a textual description to identify mismatches. It evaluates the model's ability to spot errors or inconsistencies (e.g., the text claims a decrease to 2\% while the chart shows an increase to 7\% in 2020). The task requires detailed attention to both visual and textual details, ensuring the model's output is precise and evidence-backed. A simple, clean line chart is used to focus on key data points while minimizing distractions.




\subsection{EA3 Task Design Methodology:}
\label{apdx_ea3}
The tasks in EA3 are designed to evaluate the ability of MLLMs to generate accurate and functional code from multimodal inputs. Each task assesses the model's capacity to interpret visual instructions, generate context-specific code, and execute logical steps. These include interpreting tables, flowcharts, and images that describe problems or provide structured data inputs. The model is expected to produce code that performs well-defined operations such as data visualization, arithmetic computation, or sequence generation, transforming abstract prompts into executable logic. The tasks are carefully crafted to reflect a progression in complexity, ranging from basic operations, such as extracting numerical data from an image to more intricate challenges like generating code from flowchart-based instructions. This incremental design facilitates the evaluation of a model’s ability to reason and scale its performance in increasingly demanding coding scenarios.

To support this, the visual components of EA3 tasks were created with clarity, relevance, and realism in mind. Images were designed to be well-organized and unambiguous, directly tied to the task objectives without introducing extraneous elements. They simulate realistic programming scenarios, including the interpretation of structured data or the automation of repetitive tasks. As complexity increases across tasks, so too does the visual intricacy, providing a robust benchmark for assessing the model’s capacity for abstraction and stepwise code generation.

Task 1's objective is to test the model's ability to interpret tabular data in an image and generate Python code to visualize it as a bar chart.  The task involves extracting structured data (e.g., sales over four quarters) from an image. The model must generate correct visualization code, including libraries (e.g., Matplotlib). By providing a structured table, the task ensures a clear yet challenging input for code generation. The task simulates a common data science workflow, aligning with practical applications. The key goal of Task 2 is to assess the model's ability to generate Python code (e.g., Turtle graphics) to draw a shape depicted in an image.  The task requires interpreting an image of a geometric shape (e.g., a star or hexagon). The model must write accurate code that replicates the shape using a specific library. The task evaluates the model’s attention to dimensions and proportions in the visual input. The task mirrors scenarios in educational coding exercises, emphasizing beginner-friendly challenges. Similarly, Task 3 focuses to test the model’s ability to extract numerical information from textual input in an image and compute a sum. The task requires extracting numbers from an image (e.g., a shopping receipt). The model must compute a sum based on the parsed input, testing logical reasoning. The task combines OCR (optical character recognition) capabilities with mathematical operations. The scenario mimics practical use cases, such as expense calculation from invoices. Task 4 is designed to evaluate the model’s ability to convert chart data into a structured Python dictionary.  The task requires extracting information from a bar chart or pie chart. The model must organize extracted data into a Python dictionary with key-value pairs. The task tests the model’s ability to ensure the structured output matches the visual input. This mirrors tasks in data engineering or ETL (Extract, Transform, Load) workflows. Task 5 tests the models' ability to parse text in an image (e.g., itemized receipt) and perform basic arithmetic. The task involves identifying item prices in an image and summing them up.  Models must handle inconsistencies or unclear inputs in the image (e.g., smudged text). The task simulates expense tracking and financial analysis workflows. Task 6 assess the models' ability to interpret a CSV-like structure in an image and convert it into a Python-compatible format. The task requires parsing tabular data with headers and rows.  The model must generate Python code to represent the CSV data as a list of dictionaries or Pandas DataFrame. Combines OCR capabilities with data engineering skills. This task reflects real-world scenarios in data preprocessing. Task7 evaluate the models' ability to interpret step-by-step algorithm instructions in an image and generate functional code. The task requires the model to translate a flowchart or textual steps into Python code. The model must follow a structured logical process (e.g., loops, recursion) to compute the sequence. This task emphasizes algorithm design and educational coding scenarios. Task 8 is designed to test the model’s ability to follow decision-making logic in a flowchart and implement it as a Python function.  The task requires interpreting decision nodes and paths in the flowchart. The model must translate the flowchart logic into executable code. The task evaluates whether the model can adhere to predefined logical structures.  Flowchart-based programming tasks are widely used in both educational and industrial contexts.



\subsection{EA4 Task Design Methodology}
\label{apdx_ea4}
The EA4 task set simulates a wide range of real-world challenges that demand the integration of multimodal inputs with contextual knowledge. Designed to rigorously test MLLMs' ability to reason, retrieve, and synthesize across domains, these tasks combine clear visual inputs with meaningful textual prompts to evaluate model performance in complex knowledge-driven scenarios. Each task in EA4 requires the model to retrieve relevant contextual knowledge, either from external sources or embedded within the input and make sense of the given scenario. This involves the ability to combine textual and visual content, such as maps, historical images, scientific charts, or cultural references, and generate informed outputs. The tasks are crafted to reflect authentic applications in domains such as history, science, and fact-checking, often requiring the model to resolve ambiguity by reasoning through incomplete or conflicting information.

To support this, the visuals accompanying EA4 tasks are carefully designed for contextual relevance. Each image aligns closely with the task narrative, whether depicting a historical landmark, cultural artifact, or scientific figure and provides just enough detail to encourage reasoning without introducing unnecessary visual complexity. These visuals are grounded in real-world scenarios, helping simulate tasks such as identifying locations, analyzing diagrams, or verifying facts. Moreover, they are deliberately constructed to complement the associated textual inputs, promoting effective multimodal interaction and integrated reasoning across modalities.

Task 1 evaluates the models' ability to identify a historical monument and explain its significance using external knowledge. The task requires the model to identify a landmark (e.g., a clock tower) from its image; where the model must retrieve relevant historical or cultural information about the landmark. The task tests the ability to synthesize visual data and external knowledge into a coherent explanation. Task 2 tests the model’s ability to analyze scientific data from a visual chart and textual explanation, integrating both to draw conclusions.  The task requires interpreting trends and relationships in a graph (e.g., unemployment rates). Hence the model must combine visual insights with textual explanations to infer implications. This task mimics scenarios in research or data journalism. Task 3 assess the models' ability to analyze medical images (e.g., X-rays) and integrate domain-specific knowledge to provide recommendations. The task involves identifying abnormalities or patterns in a medical image. Hence models must draw on medical knowledge to recommend next steps (e.g., tests, treatments). The task introduces subtle visual clues, requiring careful analysis, that reflects applications in AI-powered healthcare diagnostics. Tasks 4 is designed to evaluate the models' ability to interpret cultural artifacts using visual and textual context. The task involves identifying a cultural artifact (e.g., sculpture or painting) from its image. With this, the model must explain the artifact’s significance, historical background, and cultural relevance, by encouraging nuanced reasoning and interpretation. Task 5 aims to test the models' ability to analyze a map and textual description to discuss historical events. The task requires interpreting map data (e.g., trade routes, battlefields). The model must combine geographical insights with historical narratives. It tests the ability to hypothesize based on multimodal inputs, that reflects challenges in historical research or geographic analysis. Task 6 assess the ability to cross-reference data from a chart (e.g., energy trends) with textual information in an article, that combines quantitative data with qualitative reasoning. The task involves verifying claims in the article against chart data, testing the model’s ability to detect inconsistencies or validate arguments; reflects tasks in journalism or policy analysis.
Task 7 evaluate the models' ability to verify claims from headlines using multimodal inputs (e.g., images, encyclopedia excerpts). This requires cross-referencing headlines with visual and textual evidence. The model must retrieve and synthesize external information, while examining the critical thinking and consistency across modalities. Task 8 assess the model’s ability to analyze artwork in relation to its historical and cultural background. The task involves identifying key elements of the artwork and explaining their historical significance, by combining visual understanding with contextual knowledge. This design ensures that EA4 assesses knowledge-intensive multimodal capabilities in modern large models to move beyond surface-level understanding toward deeper, multi-faceted reasoning.

Detailed task descriptions and corresponding expected outputs for each evaluation aspect are provided in the supplementary material accompanying this paper.

\vspace{-5mm}
\section{Detailed Evaluation Criteria}
\label{apdx_evalcriteria}
\vspace{-5mm}
\begin{table}[H]
\caption{Detailed evaluation criteria for assessing model outputs. These criteria supplement the empirical thresholds provided in Table~\ref{tab:evaluation_thresholds} and ensure a nuanced and consistent assessment of model performance.}
\label{tab:detailed_evalcriteria}
\centering
\renewcommand{\arraystretch}{1.1}
\setlength{\tabcolsep}{8pt}
\begin{tabularx}{\textwidth}{l l X}
\toprule
\textbf{Criterion} & \textbf{Category} & \textbf{Description} \\
\midrule
\textbf{Accuracy} & Correct & The response addresses all components of the task correctly. \\
                  & Partially Correct & The response addresses only some parts of the task. \\
                  & Incorrect & The response fails to correctly address the task; any error in multi-part tasks renders it incorrect. \\
\midrule
\textbf{Relevancy} & Relevant & The response is fully aligned with the task and context. \\
                  & Partially Relevant & The response contains some relevant information but is incomplete. \\
                  & Irrelevant & The response is unrelated to the task or context. \\
\midrule
\textbf{Conciseness} & Under-Explained & The response lacks sufficient detail to explain its reasoning. \\
                   & To the Point & The response is clear, concise, and appropriately detailed. \\
                   & Over-Explained & The response includes redundant or unnecessary elaboration. \\
\midrule
\textbf{Hallucination} & Yes & The response includes irrelevant, repetitive, or random content not pertinent to the task. \\
                     & No & The response remains focused and free of irrelevant content. \\
\bottomrule
\end{tabularx}
\end{table}

For transparency and reproducibility, this section provides the detailed criteria used to evaluate model outputs across the four key metrics: Accuracy, Relevancy, Conciseness, and Hallucination.

To ensure evaluation consistency, a two-phase annotation strategy was adopted. First, one expert annotator scored model responses for each Evaluation Aspect (EA) using predefined rubrics. A second expert subsequently reviewed these annotations and flagged responses requiring further discussion. Disagreements were resolved via consensus. Although this process does not allow for inter-rater agreement statistics such as Cohen’s Kappa, it ensured that each response was reviewed twice and adjusted where needed. This peer-review protocol is aligned with best practices for high-effort manual evaluations in large-scale model analysis.


\section{Results in Detail}
\label{appendixresultsindetail}
In this section, we present the detailed numerical tables corresponding to the evaluation results discussed in the main text. While the primary paper highlights the key findings through radar charts for clearer visual comparison across models, prompting strategies, and tasks, the full tabulated results are provided here for completeness. Please refer to Fig \ref{fig:EA1perfmetricfig} to Fig \ref{fig:EA4perfmetricfig} for visualizations of these tables. 


\begin{table}[ht]
\centering
\caption{
EA1: Reasoning and Compositionality Tasks Results Summary. This table presents the average performance (in \%) of Small (S-MLLMs, \(<4\)B), Medium (M-MLLMs, 4B--10B), and Large (L-MLLMs, \(>10\)B) models on reasoning tasks. Performance metrics include Accuracy, Hallucination, Relevance (Fully and Partially Relevant), Irrelevance, Conciseness (Under Explained - UE, Over Explained - OE). Abbreviations: ZS = Zero-Shot, OS = One-Shot, FS = Few-Shot, CoT = Chain-of-Thought, Anl = Analogical, GK = Generated Knowledge, ToT = Tree-of-Thought. See \ref{sec:summaryresults} for detailed discussion and interpretation.
}
\label{tab:reasoning_results_summary}

\renewcommand{\arraystretch}{0.9}
\setlength{\tabcolsep}{4pt}
\begin{tabular}{|p{3.5cm}|p{1.7cm}|p{1.2cm}|p{1.2cm}|p{1.2cm}|p{1.2cm}|p{1.2cm}|p{1.2cm}|p{1.2cm}|}
\hline
\textbf{Metrics} & \textbf{Size} & ZS & OS & FS & CoT & Anl & GK & ToT \\ \hline
Accuracy 
  & S-MLLMs & 31.25 & 18.75 & 31.25 & 18.75 & 25 & 6.25 & 6.25 \\ \cline{2-9}
  & M-MLLMs & 31.25 & 37.5 & 31.25 & 18.75 & 37.5 & 25 & 18.75 \\ \cline{2-9}
  & L-MLLMs & 30 & 30 & 45 & 25 & 25 & 20 & 35 \\ \hline
Hallucination 
  & S-MLLMs & 37.5 & 50 & 37.5 & 31.25 & 37.5 & 50 & 75 \\ \cline{2-9}
  & M-MLLMs & 6.25 & 12.5 & 0 & 12.5 & 0 & 12.5 & 43.75 \\ \cline{2-9}
  & L-MLLMs & 0 & 0 & 5 & 15 & 5 & 15 & 30 \\ \hline
Relevance (F + P)
  & S-MLLMs & 75 & 56.25 & 62.5 & 75 & 68.75 & 56.25 & 56.25 \\ \cline{2-9}
  & M-MLLMs & 93.75 & 87.5 & 81.25 & 100 & 100 & 100 & 75 \\ \cline{2-9}
  & L-MLLMs & 95 & 90 & 90 & 90 & 95 & 90 & 95 \\ \hline
Irrelevance 
  & S-MLLMs & 25 & 43.75 & 37.5 & 25 & 31.25 & 43.75 & 43.75 \\ \cline{2-9}
  & M-MLLMs & 6.25 & 12.5 & 18.75 & 0 & 0 & 0 & 25 \\ \cline{2-9}
  & L-MLLMs & 5 & 10 & 10 & 10 & 5 & 10 & 5 \\ \hline
Conciseness (UE)
  & S-MLLMs & 25 & 12.5 & 18.75 & 25 & 31.25 & 6.25 & 0 \\ \cline{2-9}
  & M-MLLMs & 50 & 50 & 43.75 & 43.75 & 62.5 & 43.75 & 43.75 \\ \cline{2-9}
  & L-MLLMs & 50 & 75 & 70 & 50 & 60 & 45 & 50 \\ \hline
Conciseness (TP+OE)
  & S-MLLMs & 75 & 87.5 & 81.25 & 75 & 68.75 & 93.75 & 100 \\ \cline{2-9}
  & M-MLLMs & 50 & 50 & 56.25 & 56.25 & 37.5 & 56.25 & 56.25 \\ \cline{2-9}
  & L-MLLMs & 50 & 25 & 30 & 50 & 40 & 55 & 50 \\ \hline
\end{tabular}
\end{table}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\begin{table}[ht]
\centering
\caption{
EA2: Multimodal Understanding and Alignment Tasks Results Summary. This table displays the average performance (in \%) of Small (S-MLLMs, \(<4\)B), Medium (M-MLLMs, 4B--10B), and Large (L-MLLMs, \(>10\)B) models on tasks requiring multimodal understanding. Performance metrics include Accuracy, Hallucination, Relevance (Fully and Partially Relevant), Irrelevance, Conciseness (Under Explained - UE, Over Explained - OE). Abbreviations: ZS = Zero-Shot, OS = One-Shot, FS = Few-Shot, CoT = Chain-of-Thought, Anl = Analogical, GK = Generated Knowledge, ToT = Tree-of-Thought. See \ref{sec:summaryresults} EA2.
\label{tab:model_understanding_tasks_summary_results}
}
\renewcommand{\arraystretch}{0.9}
\setlength{\tabcolsep}{4pt}
\begin{tabular}{|p{3.5cm}|p{1.7cm}|p{1.2cm}|p{1.2cm}|p{1.2cm}|p{1.2cm}|p{1.2cm}|p{1.2cm}|p{1.2cm}|}
\hline
\textbf{Metrics} & \textbf{Size} & ZS & OS & FS & CoT & Anl & GK & ToT \\ \hline
Accuracy 
  & S-MLLMs & 31.25 & 6.25 & 12.5 & 12.5 & 18.75 & 6.25 & 0 \\ \cline{2-9}
  & M-MLLMs & 43.75 & 37.5 & 37.5 & 37.5 & 56.25 & 43.75 & 43.75 \\ \cline{2-9}
  & L-MLLMs & 56.25 & 43.75 & 37.5 & 43.75 & 37.5 & 37.5 & 43.75 \\ \hline
Hallucination 
  & S-MLLMs & 12.5 & 25 & 37.5 & 25 & 37.5 & 43.75 & 50 \\ \cline{2-9}
  & M-MLLMs & 0 & 0 & 0 & 0 & 6.25 & 6.25 & 6.25 \\ \cline{2-9}
  & L-MLLMs & 0 & 6.25 & 0 & 0 & 25 & 6.25 & 12.5 \\ \hline
Relevance (F + P)
  & S-MLLMs & 87.5 & 75 & 75 & 81.25 & 75 & 81.25 & 62.5 \\ \cline{2-9}
  & M-MLLMs & 100 & 100 & 100 & 100 & 100 & 100 & 100 \\ \cline{2-9}
  & L-MLLMs & 100 & 100 & 100 & 100 & 93.75 & 93.75 & 93.75 \\ \hline
Irrelevance 
  & S-MLLMs & 12.5 & 25 & 25 & 18.75 & 25 & 18.75 & 37.5 \\ \cline{2-9}
  & M-MLLMs & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ \cline{2-9}
  & L-MLLMs & 0 & 0 & 0 & 0 & 6.25 & 6.25 & 6.25 \\ \hline
Conciseness (UE)
  & S-MLLMs & 50 & 50 & 50 & 37.5 & 37.5 & 31.25 & 31.25 \\ \cline{2-9}
  & M-MLLMs & 81.25 & 93.75 & 87.5 & 75 & 56.25 & 50 & 56.25 \\ \cline{2-9}
  & L-MLLMs & 68.75 & 68.75 & 75 & 56.25 & 50 & 56.25 & 68.75 \\ \hline
Conciseness (TP+OE)
  & S-MLLMs & 50 & 50 & 50 & 62.5 & 62.5 & 68.75 & 68.75 \\ \cline{2-9}
  & M-MLLMs & 18.75 & 6.25 & 12.5 & 25 & 43.75 & 50 & 43.75 \\ \cline{2-9}
  & L-MLLMs & 31.25 & 31.25 & 25 & 43.75 & 50 & 43.75 & 31.25 \\ \hline
\end{tabular}
\end{table}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%



\begin{table}[ht]
\centering
\caption{
EA3: Complex Code Generation and Execution Tasks Results Summary. This table reports the average performance (in \%) of Small, Medium, and Large MLLMs on code generation tasks. See detailed discussion in \ref{sec:summaryresults} EA3. Performance metrics include Accuracy, Hallucination, Relevance (Fully and Partially Relevant), Irrelevance, Conciseness (Under Explained - UE, Over Explained - OE). Abbreviations: ZS = Zero-Shot, OS = One-Shot, FS = Few-Shot, CoT = Chain-of-Thought, Anl = Analogical, GK = Generated Knowledge, ToT = Tree-of-Thought.}
\label{tab:code_generation_tasks_summary_results}
\renewcommand{\arraystretch}{0.9}
\setlength{\tabcolsep}{4pt}
\begin{tabular}{|p{3.5cm}|p{1.7cm}|p{1.2cm}|p{1.2cm}|p{1.2cm}|p{1.2cm}|p{1.2cm}|p{1.2cm}|p{1.2cm}|}
\hline
Metrics & Size & ZS & OS & FS & CoT & Anl & GK & ToT \\ \hline
Accuracy 
  & S-MLLMs & 46.88 & 56.25 & 53.12 & 50 & 25 & 40.62 & 31.25 \\ \cline{2-9}
  & M-MLLMs & 78.12 & 87.50 & 90.62 & 84.38 & 65.62 & 78.12 & 78.12 \\ \cline{2-9}
  & L-MLLMs & 84.38 & 87.50 & 96.88 & 93.75 & 78.12 & 78.12 & 84.38 \\ \hline
Hallucination 
  & S-MLLMs & 37.5 & 28.12 & 31.25 & 31.25 & 40.62 & 37.5 & 46.88 \\ \cline{2-9}
  & M-MLLMs & 3.12 & 0 & 0 & 0 & 12.5 & 6.25 & 3.12 \\ \cline{2-9}
  & L-MLLMs & 0 & 0 & 0 & 0 & 15.62 & 6.25 & 6.25 \\ \hline
Relevance (F + P)
  & S-MLLMs & 81.25 & 78.12 & 84.37 & 87.50 & 71.88 & 71.88 & 75 \\ \cline{2-9}
  & M-MLLMs & 100 & 100 & 100 & 100 & 100 & 100 & 100 \\ \cline{2-9}
  & L-MLLMs & 100 & 100 & 100 & 100 & 96.88 & 100 & 100 \\ \hline
Irrelevance 
  & S-MLLMs & 18.75 & 21.88 & 15.63 & 12.5 & 28.12 & 28.12 & 25 \\ \cline{2-9}
  & M-MLLMs & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ \cline{2-9}
  & L-MLLMs & 0 & 0 & 0 & 0 & 3.12 & 0 & 0 \\ \hline
Conciseness (UE)
  & S-MLLMs & 31.25 & 56.25 & 37.5 & 37.5 & 21.88 & 34.38 & 21.88 \\ \cline{2-9}
  & M-MLLMs & 21.88 & 31.25 & 31.25 & 18.75 & 3.12 & 12.25 & 3.12 \\ \cline{2-9}
  & L-MLLMs & 9.38 & 15.62 & 21.88 & 12.5 & 6.25 & 12.25 & 3.12 \\ \hline
Conciseness (TP+OE)
  & S-MLLMs & 68.75 & 43.75 & 62.5 & 62.5 & 78.12 & 65.62 & 78.12 \\ \cline{2-9}
  & M-MLLMs & 78.12 & 68.75 & 68.75 & 81.25 & 96.88 & 87.5 & 96.88 \\ \cline{2-9}
  & L-MLLMs & 90.62 & 84.38 & 78.13 & 87.5 & 93.75 & 87.5 & 96.88 \\ \hline
\end{tabular}
\end{table}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\begin{table}[ht]
\centering
\caption{
EA4: Knowledge Retrieval and Integration Tasks Results Summary. This table presents the average performance (in \%) of Small, Medium, and Large MLLMs on knowledge retrieval tasks. Performance metrics include Accuracy, Hallucination, Relevance (Fully and Partially Relevant), Irrelevance, Conciseness (Under Explained - UE, Over Explained - OE). The prompting techniques assessed include Zero-Shot (ZS), One-Shot (OS), Few-Shot (FS), Chain-of-Thought (CoT), Analogical (Anl), Generated Knowledge (GK), and Tree of Thought (ToT). See \ref{sec:summaryresults} EA4 for in-depth commentary on performance patterns and hallucination.
}
\label{tab:knowledge_retrieval_results_summary}
\renewcommand{\arraystretch}{0.9}
\setlength{\tabcolsep}{4pt}
\begin{tabular}{|p{3.3cm}|p{1.7cm}|p{1.3cm}|p{1.3cm}|p{1.3cm}|p{1.3cm}|p{1.3cm}|p{1.3cm}|p{1.3cm}|}
\hline
Metrics & Size & ZS & OS & FS & CoT & Anl & GK & ToT \\ \hline
Accuracy 
  & S-MLLMs & 43.75 & 34.38 & 37.5 & 37.5 & 21.88 & 28.12 & 32.26 \\ \cline{2-9}
  & M-MLLMs & 71.88 & 59.38 & 62.5 & 62.5 & 53.12 & 53.12 & 62.5 \\ \cline{2-9}
  & L-MLLMs & 87.5 & 77.5 & 75 & 77.5 & 75 & 53.12 & 64.1 \\ \hline
Hallucination 
  & S-MLLMs & 25 & 31.25 & 34.38 & 25 & 40.62 & 43.75 & 51.61 \\ \cline{2-9}
  & M-MLLMs & 0 & 0 & 3.12 & 0 & 6.25 & 0 & 6.25 \\ \cline{2-9}
  & L-MLLMs & 0 & 0 & 0 & 0 & 7.5 & 0 & 2.56 \\ \hline
Relevance (F + P)
  & S-MLLMs & 81.25 & 78.12 & 71.88 & 78.12 & 68.75 & 68.75 & 67.74 \\ \cline{2-9}
  & M-MLLMs & 100 & 100 & 96.88 & 100 & 100 & 96.88 & 93.75 \\ \cline{2-9}
  & L-MLLMs & 97.5 & 100 & 100 & 100 & 100 & 96.88 & 94.87 \\ \hline
Irrelevance 
  & S-MLLMs & 18.75 & 21.88 & 28.12 & 21.88 & 31.25 & 31.25 & 32.26 \\ \cline{2-9}
  & M-MLLMs & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ \cline{2-9}
  & L-MLLMs & 2.5 & 0 & 0 & 0 & 0 & 3.12 & 5.13 \\ \hline
Conciseness (UE)
  & S-MLLMs & 53.12 & 31.25 & 37.5 & 28.12 & 15.62 & 21.88 & 19.35 \\ \cline{2-9}
  & M-MLLMs & 28.12 & 62.5 & 62.5 & 43.75 & 53.12 & 46.88 & 34.38 \\ \cline{2-9}
  & L-MLLMs & 82.5 & 87.5 & 82.5 & 55 & 55 & 46.88 & 41.03 \\ \hline
Conciseness(TP+OE)
  & S-MLLMs & 46.88 & 68.75 & 62.5 & 71.88 & 84.38 & 78.12 & 80.65 \\ \cline{2-9}
  & M-MLLMs & 71.88 & 37.5 & 37.5 & 56.25 & 46.88 & 53.12 & 65.42 \\ \cline{2-9}
  & L-MLLMs & 17.5 & 12.5 & 17.5 & 45 & 45 & 53.12 & 58.97 \\ \hline
\end{tabular}
\end{table}

\section{Case Study}
This section presents a qualitative case study illustrating how different prompting styles influence the performance of a single MLLM on the same task. We selected the Qwen/Qwen2-VL-2B-Instruct model and task EA1 T3 (Reasoning and Compositionality – Sales Data Analysis) as a representative example. The model’s outputs are compared across multiple prompting techniques, including Zero-Shot, Few-Shot, Chain-of-Thought, Analogical, and Tree-of-Thought in the table \ref{tab:casestudy}. For each style, we report the model’s output alongside an evaluators' observation highlighting correctness, relevance, reasoning quality, and notable error patterns. This example demonstrates the qualitative impact of prompt design on model behaviour, complementing the aggregate quantitative results reported in the main text. By linking directly to the prompting styles introduced in Section 2.3 and expected correct output presented in the supplementary materials (S1.3), this case study provides concrete evidence of how those methodologies translate into practical model performance.

\captionsetup{justification=centering, singlelinecheck=false}
\begin{longtable}{p{1.5cm} p{10.5cm} p{2.5cm}}
\caption{Case study of Qwen/Qwen2-VL-2B-Instruct performance on EA1\_T3 (Sales Data Analysis) across three different prompting styles.}
\label{tab:casestudy} \\

\hline
\textbf{Prompting Style} & \textbf{Model Output} & \textbf{Observation} \\ 
\hline
\endfirsthead

\multicolumn{3}{c}{{\tablename\ \thetable{} -- continued from previous page}} \\
\hline
\textbf{Prompting Style} & \textbf{Model Output} & \textbf{Observation} \\ 
\hline
\endhead

\hline \multicolumn{3}{r}{{Continued on next page}} \\ \hline
\endfoot

\hline
\endlastfoot

Zero-Shot & ['Q.1. The product with the highest total sales over the year is Product Z. Q.2. Trends and patterns in sales for Product X, Product Y, and Product Z across the four quarters are as follows:Product X: The sales for Product X have been relatively stable, with a slight increase in sales in Q4 compared to Q3. - Product Y: The sales for Product Y have been relatively stable, with a slight increase in sales in Q4 compared to Q3. Product Z: The sales for Product Z have been consistently high, with a slight increase in sales in Q4 compared to Q3.Q.3. Based on the data, Product Z is likely to perform best in Q1 of the next year. Reasoning:  Product Z has consistently high sales across all quarters, with the highest sales in Q4. Product Z has the highest sales in Q4 compared to Product X and Product Y. Product Z has the highest sales in Q4 compared to Product X and Product Y. Product Z has the highest sales in Q4 compared to Product X and Product Y.] (the sentence "Product Z has the highest sales in Q4 compared to Product X and Product Y" repreats for 16 times)  &  Correct; partially relevant; under-explained; tendency to repeat phrases; no explicit intermediate calculations.  \\ \hline

Few-Shot & ['Q.1. The product with the highest total sales over the year is Product Z. The total sales for Product Z in Q1, Q2, Q3, and Q4 are $35,000, $30,000, \$25,000, and \$20,000, respectively. Therefore, the total sales for Product Z are the highest among the three products. Q.2. Trends and patterns in sales for Product X, Product Y, and Product Z across the four quarters are as follows:- Product X: The sales for Product X show a steady increase each quarter. This indicates a consistent demand for Product X over the year. Product Y: The sales for Product Y show a fluctuating pattern with a drop in Q2 and a rise in Q3 and Q4. This suggests that Product Y may experience a temporary dip in demand in Q2 but recovers in Q3 and Q4.- Product Z: The sales for Product Z show a significant rise in Q4. This indicates that Product Z is likely to perform well in Q1 of the next year. Q.3. Based on the data, Product Z is likely to perform best in Q1 of the next year. This is because Product Z had the highest total sales over the year and the sales for Product Z in Q1, Q2, Q3, and Q4 are \$35,000, \$30,000, \$25,000, and \$20,000, respectively. Therefore, Product Z is expected to have the highest sales in Q1 of the next year.'] & Correct; relevant; concise reasoning; omitted explicit calculations but followed example patterns well. \\ \hline

Chain-of-Thought & ['Q.1: Which product had the highest total sales over the year and why? Chain of Thought 1: First, calculate the total sales for each product by summing the sales from all four quarters.Product X showed the highest total sales, followed by Product Y and then Product Z.Product X consistently had the highest sales throughout the year, suggesting it is the most profitable product. Q.2: Identify trends or patterns in the sales for each product and provide the reason behind those trends or patterns. Chain of Thought 2: Product X consistently had the highest sales, indicating it is the most profitable product. Product Y showed a steady increase in sales, suggesting it is a growing product. Product Z showed a significant increase in Q4, indicating it may be a seasonal product. Q.3: Predict which product will likely perform best in Q1 of the next year, and explain your reasoning step-by-step. Chain of Thought 3:Product X consistently had the highest sales, indicating it is the most profitable product. Product Y showed a steady increase in sales, suggesting it is a growing product. Product Z showed a significant increase in Q4, indicating it may be a seasonal product. Based on the trend analysis, Product X is likely to continue to perform well in Q1 of the next year, as it has consistently shown the highest sales over the year.'] & Incorrect; irrelevant; under-explained; OCR/object-detection errors leading to misread numbers; faulty intermediate steps. \\
\hline
\end{longtable}




\section{Experimental Setup}
\label{experimental_setup}

The experiment was conducted on two separate servers, each with distinct hardware specifications. Connect1 server was used to run the experiment using Python-based implementations, leveraging its optimized computational capabilities for executing MLLM workloads. The Connect3 server was utilized to run the experiment with Ollama for getting the results of Llama3.2 Vision model. The model is queried from Connect1 server through API call. 

\begin{table}[H]
\centering
\caption{Connect1 Server Specification}
\begin{tabular}{|ll|}
\hline
\multicolumn{2}{|c|}{\textbf{CPU Information}}                                                    \\ \hline
\multicolumn{1}{|l|}{\textbf{Technical Specification}} & \textbf{Intel Cascadelake SP processor}  \\ \hline
\multicolumn{1}{|l|}{Processor}                        & Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz \\ \hline
\multicolumn{1}{|l|}{OS}                               & Ubuntu 23.04                             \\ \hline
\multicolumn{1}{|l|}{Micro-architecture}               & Cascadelake                              \\ \hline
\multicolumn{1}{|l|}{Thread(s) per core}               & 2                                        \\ \hline
\multicolumn{1}{|l|}{Cores per socket}                 & 24                                       \\ \hline
\multicolumn{1}{|l|}{Socket(s)}                        & 2                                        \\ \hline
\multicolumn{1}{|l|}{NUMA node(s)}                     & 2                                        \\ \hline
\multicolumn{1}{|l|}{L1d cache}                        & 1.5 MiB                                  \\ \hline
\multicolumn{1}{|l|}{L1I cache}                        & 1.5 MiB                                  \\ \hline
\multicolumn{1}{|l|}{L2 cache}                         & 48 MiB                                   \\ \hline
\multicolumn{1}{|l|}{L3 cache}                         & 71.5 MiB                                 \\ \hline
\multicolumn{1}{|l|}{Main memory}                      & 256 GB                                   \\ \hline
\multicolumn{2}{|c|}{\textbf{GPU Information}}                                                    \\ \hline
\multicolumn{1}{|l|}{GPU Model}                        & NVIDIA RTX A6000                         \\ \hline
\multicolumn{1}{|l|}{Memory}                           & 48 GB                                    \\ \hline
\multicolumn{1}{|l|}{Compute Capability}               & 8.6                                      \\ \hline
\end{tabular}
\end{table}

\begin{table}[H]
\centering
\caption{Connect3 Server Specification}
\begin{tabular}{|ll|}
\hline
\multicolumn{2}{|c|}{\textbf{CPU Information}}                                                     \\ \hline
\multicolumn{1}{|l|}{\textbf{Technical Specification}} & \textbf{Intel Cascadelake SP processor}   \\ \hline
\multicolumn{1}{|l|}{Processor}                        & Intel(R) Xeon(R) Gold 5220R CPU @ 2.20GHz \\ \hline
\multicolumn{1}{|l|}{OS}                               & Ubuntu 23.04                              \\ \hline
\multicolumn{1}{|l|}{Micro-architecture}               & Cascadelake                               \\ \hline
\multicolumn{1}{|l|}{Thread(s) per core}               & 2                                         \\ \hline
\multicolumn{1}{|l|}{Cores per socket}                 & 24                                        \\ \hline
\multicolumn{1}{|l|}{Socket(s)}                        & 2                                         \\ \hline
\multicolumn{1}{|l|}{NUMA node(s)}                     & 2                                         \\ \hline
\multicolumn{1}{|l|}{L1d cache}                        & 1.5 MiB                                   \\ \hline
\multicolumn{1}{|l|}{L1I cache}                        & 1.5 MiB                                   \\ \hline
\multicolumn{1}{|l|}{L2 cache}                         & 48 MiB                                    \\ \hline
\multicolumn{1}{|l|}{L3 cache}                         & 71.5 MiB                                  \\ \hline
\multicolumn{1}{|l|}{Main memory}                      & 256 GB                                    \\ \hline
\multicolumn{2}{|c|}{\textbf{GPU Information}}                                                     \\ \hline
\multicolumn{1}{|l|}{GPU Model}                        & Persistence-M                             \\ \hline
\multicolumn{1}{|l|}{Memory}                           & 24 GB                                     \\ \hline
\multicolumn{1}{|l|}{Compute Capability}               & 8                                         \\ \hline
\end{tabular}
\end{table}

