\documentclass[10pt]{article} % For LaTeX2e
% \usepackage{tmlr}
% If accepted, instead use the following line for the camera-ready submission:
\usepackage[accepted]{tmlr}
% To de-anonymize and remove mentions to TMLR (for example for posting to preprint servers), instead use the following:
%\usepackage[preprint]{tmlr}

% Optional math commands from https://github.com/goodfeli/dlbook_notation.
\input{math_commands.tex}

\usepackage{hyperref}


\usepackage{url}


%additional packages by authors:
\usepackage{graphicx}
\usepackage{rotating}  % Required for rotated tables
\usepackage{multirow}  % Allows multirow cells
\usepackage{array}     % Allows better column width control
\usepackage{graphicx}  % Required for \resizebox
\usepackage{tabularx}  % For tables that automatically adjust column widths
\usepackage{float}     % Allows use of [H] placement specifier
\usepackage{changepage} % Required for adjustwidth environment (optional)
% \usepackage[numbers]{natbib}
% \usepackage{caption}
\usepackage{listings}
% \usepackage[table,xcdraw]{xcolor} % For table coloring
\usepackage{booktabs}
\usepackage{placeins}
\hypersetup{pageanchor=false}
\usepackage{longtable} 
\usepackage{caption}






\title{The Future of MLLM Prompting is Adaptive: \\A Comprehensive Experimental Evaluation of Prompt \\Engineering Methods for Robust Multimodal Performance}

% Authors must not appear in the submitted version. They should be hidden
% as long as the tmlr package is used without the [accepted] or [preprint] options.
% Non-anonymous submissions will be rejected without review.

\author{\name Anwesha Mohanty \email anwesha.mohanty@ucd.ie \\
      \addr CeADAR: Ireland’s Centre for AI\\
      University College Dublin, Belfield, Dublin 4, Ireland
      \AND
      \name Venkatesh Balavadhani Parthasarathy \email venkatesh.parthasarathy@ucd.ie \\
      \addr CeADAR: Ireland’s Centre for AI \\
      University College Dublin, Belfield, Dublin 4, Ireland
      \AND
      \name Arsalan Shahid \email arsalan.shahid@ucd.ie\\
      \addr CeADAR: Ireland’s Centre for AI \\
      University College Dublin, Belfield, Dublin 4, Ireland
     }

% The \author macro works with any number of authors. Use \AND 
% to separate the names and addresses of multiple authors.

\newcommand{\fix}{\marginpar{FIX}}
\newcommand{\new}{\marginpar{NEW}}

\def\month{10}  % Insert correct month for camera-ready version
\def\year{2025} % Insert correct year for camera-ready version
\def\openreview{\url{https://openreview.net/forum?id=B1L8HrjoA1}} % Insert correct link to OpenReview for camera-ready version

% % Disable hyperlinking of page numbers (but keep others)
% \makeatletter
% \let\Hy@raisedlink\@empty
% \makeatother

\begin{document}


\maketitle

\begin{abstract}
Multimodal Large Language Models (MLLMs) are set to transform how machines process and generate human-like responses by integrating diverse modalities such as text, images, and code. In this study, we specifically focus on text–image multimodal reasoning and understanding, evaluating their performance across diverse task categories. Yet, effectively harnessing their capabilities hinges on optimal prompt engineering. We present a comprehensive experimental evaluation of seven prompt engineering methods applied to 13 open-source MLLMs over 24 tasks spanning Reasoning and Compositionality, Multimodal Understanding and Alignment, Complex Code Generation and Execution, and Knowledge Retrieval and Integration. Our approach stratifies models by parameter count into Small (\(<4\)B), Medium (4B–10B), and Large (\(>10\)B) categories and compares prompting techniques including Zero-Shot, One-Shot, Few-Shot, Chain-of-Thought, Analogical, Generated Knowledge, and Tree-of-Thought. Our experiments reveal that while Large MLLMs excel in structured tasks such as code generation and execution, achieving accuracies as high as 96.88\% under Few-Shot prompting. In multimodal understanding and alignment (with relevance scores reaching 100\% using Zero-Shot prompting), all models struggle with complex reasoning and abstract model understanding, often yielding accuracies below 60\% and high hallucination rates. Notably, structured reasoning prompts (Chain-of-Thought, Analogical, Generated Knowledge and Tree-of-Thought) frequently increased hallucination up to 75\% in small models and led to longer response times (exceeding 20 seconds in Large MLLMs), while simpler prompting methods (One-Shot and Few-Shot) provided more concise and efficient outputs. Our findings underscore that no single prompting method uniformly optimizes all task types. Instead, adaptive prompting strategies that combine the strengths of example-based guidance with selective structured reasoning are essential to enhance robustness, efficiency, and factual accuracy in MLLMs. Our work provides critical insights and actionable recommendations for optimizing prompt engineering in text–image multimodal contexts, paving the way for more reliable deployment of MLLMs in real-world applications ranging from AI-assisted coding and knowledge retrieval to visual–textual content understanding.
% 246 words 
\end{abstract}

\section{Introduction}

The rapid evolution of Multimodal Large Language Models (MLLMs) has catalyzed a paradigm shift in bridging visual representation learning with natural language understanding, thereby enabling sophisticated multimodal reasoning and broader real-world applicability. Although conventional Large Language Models (LLMs) have demonstrated impressive scaling behaviors \citep{zeng2025scaling}, the leap toward multimodality is primarily driven by the advent of increasingly capable LLM backbones \citep{liu2024llavanext}. In this work, we focus specifically on the text–image modality, which remains the most mature and widely benchmarked setting for MLLMs, while recognising that the broader class of multimodal systems can also incorporate audio, video, and other sensory data. Nonetheless, a critical gap persists in the seamless integration of visual processing components with language models. Many current MLLMs employ vision transformer-based architectures, most notably CLIP \citep{radford2021learning, zhai2023sigmoid} as feature extractors, often relegating visual understanding to a secondary role rather than embedding it integrally within the reasoning pipeline. Although alternative approaches, such as self-supervised learning methods exemplified by DINO \citep{oquab2023dinov2}, have shown promise, systematic studies that jointly address architectural design and contextual prompting strategies remain scarce \citep{lu2023reference, liu2023pre}.

MLLMs are inherently designed to process heterogeneous data modalities, thereby expanding their task coverage significantly. However, their effective instruction-following and context comprehension are hindered by suboptimal integration of vision and language modules, inadequate prompt design, and inconsistencies in input representation \citep{liu2023pre, arif2025fixing}. While many evaluations focus on benchmarking MLLM performance across diverse tasks, they frequently overlook critical factors such as:
\begin{itemize}
    \item The compatibility between MLLM architectures and the evaluation dimensions pertinent to specific tasks.
    \item The alignment between training datasets and evaluation benchmarks, ensuring models are optimally calibrated for the tasks they are assessed on.
    \item The effectiveness of prompt engineering techniques in enhancing multimodal understanding and robust instruction-following.
\end{itemize}

To address these shortcomings, our study explores an evaluation-centric and prompt-based framework. We begin by surveying prevalent use cases and delineating the task requirements and evaluation aspects essential for effective multimodal reasoning. This analysis informs our selection of MLLMs, highlighting both architectural diversity and contextual considerations. Subsequently, we rigorously investigate a range of prompt engineering methodologies, examining their impact on enhancing multimodal context integration and overall instruction adherence. In doing so, we propose a comprehensive evaluation framework that elevates multimodal instruction-following as a pivotal performance metric for MLLMs.


\subsection{MLLM Architecture and Applications}

Although LLMs are optimized for text-based inputs and outputs, MLLMs extend these capabilities to incorporate images, videos, and audio, necessitating more complex architectural integrations. Typically, an MLLM comprises three primary components: a Modality Encoder, a Transformation Layer, and an LLM backbone, as depicted in Figure \ref{mllm_architecture} \citep{vaswani2023attentionneed, zhang2024mmllmsrecentadvancesmultimodal}.

\begin{figure}[ht]
    \centering
    \includegraphics[width=1\textwidth, angle=0]{Figures_main/MLLM_Architecture.png}
    \caption{A high-level overview of a typical MLLM pipeline. Multiple input modalities (e.g., images, video, audio) are first processed by dedicated modality encoders (e.g., ViT \citep{dosovitskiy2020image}, CLIP-ViT \citep{radford2021learning}, BEiT \citep{bao2021beit}). The encoded features are then projected or transformed via components such as linear projections, MLPs, or cross-attention to align with the text embedding space. Finally, the LLM backbone (e.g., Qwen, LLaMA, Falcon) integrates these multimodal features for unified reasoning and generation.}
    \label{mllm_architecture}
\end{figure}

MLLMs, especially those with the capability to process images and videos, have seen rapid development, as underscored by the HuggingFace VLM Leaderboard \citep{duan2024vlmevalkit}, which now lists more than 200 models introduced since 2022. Notable contributions in this space include the categorization of vision language models (VLMs) by \citep{ghosh2024exploringfrontiervisionlanguagemodels} and  \citep{Yin_2024}, which analyze architectures, training methodologies, and evaluation metrics.

Specialized applications further highlight the potential of MLLMs. \citep{niu2024textmultimodalityexploringevolution} demonstrate the integration of multimodal data in healthcare for clinical decision-making and medical imaging analysis.  \citep{wang2023finvisgptmultimodallargelanguage} introduce FinVis-GPT for financial graph analysis through custom datasets and instruction-based annotations, while \citep{Liang2024.09.29.615524} present DrugChat for the prediction of drug molecule properties. Furthermore,  \citep{Bewersdorff_2025} propose a framework for integrating MLLMs into science education, and \citep{yang2024seedstorymultimodallongstory} develop SEED-Story for generating multimodal narratives using novel attention mechanisms.

A common architectural trend among these models is the use of a vision encoder coupled with an LLM, often linked via a transformation layer \citep{zhang2024mmllmsrecentadvancesmultimodal}. While CLIP-based models excel in zero-shot prompting scenarios \citep{kojima2022large}, they frequently fall short on fine-grained tasks. In contrast, ViT-based architectures \citep{dosovitskiy2020image, wu2020visual, xiao2021early} offer robust spatial attention but incur higher computational costs and reduced interpretability.

\subsection{Evaluation Challenges and Recent Advances}

The HuggingFace VLM Leaderboard \citep{duan2024vlmevalkit} reveals heterogeneous performance across benchmarks; ranging from Vision Q\&A and OCR to RealWorldQA, ChartQA, and MathQA. Notably, leading models such as Step-1o \citep{duan2024vlmevalkit} underperform on benchmarks like HallusionBench \citep{guan2024hallusionbench}, highlighting nuanced trade-offs in current evaluation schemes. Most benchmarks follow a zero-shot evaluation approach with their own set of metrics, typically presenting models with a question and multiple-choice options based on an image. The use of simple and uniform prompts for all samples can lead to an underestimation of model capabilities. The sensitivity of both LLMs and MLLMs to prompt variations, as noted by \citep{xie2024tpevaltapmultimodalllms}, further suggests that uniform prompt designs may fail to capture a model's true potential.

Recent studies have attempted to overcome these limitations through heuristic and ensemble prompting techniques in zero-shot and few-shot settings \citep{brown2020language, sivarajkumar2024empirical}. Enhanced evaluation frameworks, such as those proposed by \citep{hao2025mllmsreasonmultimodalityemma}, indicate that even state-of-the-art models like GPT-o1 \citep{jaech2024openai} struggle to exceed 50\% accuracy in multimodal reasoning tasks despite employing Chain-of-Thought (CoT) prompting \citep{ge2023chain}. Structured CoT (SCoT) approaches have demonstrated improvements of up to 13.79\% over traditional CoT methods \citep{brown2020language, deepseek2025deepseek}.

Complementing these evaluations, \citep{Jiang_2024} provide a holistic assessment of Large Vision-Language Models (LVLMs) across both specialized tasks (e.g., object detection and medical diagnosis) and general tasks (e.g., object counting and spatial reasoning). Their evaluation, which includes models such as MiniGPT-v2 \citep{chen2023minigpt}, LLaVA-1.5 \citep{liu2024improved}, and Shikra \citep{chen2023shikra}, along with assessments via GPT-4V \citep{OpenAI2025Models} highlights ongoing challenges including limited cognition, object hallucination, and robustness issues. Additionally, research by \citep{li2025structured} explores Structured CoT for improving code generation by incorporating programming principles (sequential, branch, and loop) to guide reasoning steps before coding; however, code generation remains underexplored within the MLLM context.

Different models exhibit varying sensitivities to the same prompt changes, and existing evaluation frameworks often fail to address prompt-induced bias, leading to unfair comparisons. These architectural limitations, inconsistencies in reported benchmark accuracies, and current evaluation methodologies raise significant questions about the true potential of these models.

\subsection{Our Approach and Contributions}

To address these challenges, we developed a comprehensive experimental framework to rigorously evaluate 13 open-source MLLMs across four key aspects: Multimodal Reasoning, Model Understanding, Knowledge Retrieval, and Code Generation. The selected models represent a stratified sample based on parameter sizes and are paired with diverse LLM backbones to facilitate a robust evaluation of both performance and real-world applicability. The main contributions of this paper include: 

\begin{enumerate}
    \item A comparative study of seven prompt engineering methods applied to 13 open-source MLLMs, evaluating their performance on 24 tasks across four evaluation aspects. In doing so, we design a diverse set of tasks, create standardized prompt templates, and share both the datasets and templates to ensure reproducibility.
    \item An in-depth analysis of how prompt engineering strategies interact with task types and model scales, providing insights into model performance, reliability, and resource requirements, as well as offering best practices and actionable recommendations for optimizing performance in real-world applications.
\end{enumerate}



\section{Methods}

To understand the impact of prompt engineering techniques across diverse tasks and evaluation metrics, we employed a four-staged experimental design and evaluation framework including:

\begin{itemize}
    \item \textbf{Stage 1 – Defining Core Evaluation Aspects:} We define four Evaluation Aspects (EAs) to provide a comprehensive analysis of model performance across reasoning, multimodal interpretation, code generation, and knowledge integration. Detailed discussion of these aspects is provided in Section~\ref{modelEAs}. For each EA, we curated a set of tasks designed to challenge the models’ ability to integrate and process multimodal inputs (primarily images) across real-world scenarios. The corresponding tasks are listed in Tables~\ref{tab:ea1_tasks}, \ref{tab:ea2_tasks}, \ref{tab:ea3_tasks}, and \ref{tab:ea4_tasks}.
    
    \item \textbf{Stage 2 – Review and Selection of MLLMs:} A diverse set of 13 open-source MLLMs were chosen based on their architecture, parameter size, and availability (as discussed in Section~\ref{modelselection}). These models showcase the breadth of current MLLM capabilities by combining different image and text models, each built upon a distinct text encoder.
    
    \item \textbf{Stage 3 – Selection of Prompt Engineering Methods:} We apply seven prompting methods including Zero-Shot, One-Shot, Few-Shot, Chain-of-Thought, Analogical, Generated Knowledge, and Tree-of-Thought prompting. Section~\ref{PromptEnggTechs} provides further discussions on selected methods.
    
    \item \textbf{Stage 4 – Evaluation Framework:} Our experimental setup was designed to ensure consistency and reliability across multiple tasks and prompting techniques. Model outputs were evaluated along two primary dimensions. The first dimension focuses on task performance, using four key metrics: accuracy, relevancy, conciseness, and hallucination (detailed in Section~\ref{manualevaluationcriteria}). The second dimension assesses resource consumption, including inference time and memory consumption. A comprehensive manual review process is devised to analyse MLLM outputs to assess aforementioned metrics.
\end{itemize}


\subsection{Stage 1: Defining Core Evaluation Aspects} \label{modelEAs}

Evaluating MLLMs is crucial to understanding their capabilities, limitations, and applicability across diverse domains. Prior studies \citep{li2024llava, wu2025controlmllm, yu2023mmvet} have explored various evaluation aspects, from perceptual understanding and compositional reasoning to multimodal alignment and task-specific problem solving. Unlike traditional unimodal models, MLLMs require frameworks that assess their ability to process, integrate, and generate outputs from multiple modalities (e.g., text, images, and in some cases audio and video). Reviews such as \citep{nwae403} typically categorize evaluations into general multimodal understanding, task-specific assessments, and trustworthiness metrics, ensuring that models transition from broad reasoning to reliable, specialized real-world applications. Motivated by these insights and based on our review of current applications and evaluations in MLLM landscape (see Appendix~\ref{rationale-for-selected-EAs}), we select four core Evaluation Aspects (EAs) for their broad impact and practical relevance:

\begin{enumerate}
    \item \textbf{Reasoning and Compositionality (EA1):} This aspect centers on the model’s ability to process textual and visual cues, perform logical deductions, and synthesizing disparate information into coherent outputs. It tests capabilities such as Multi-Step Problem Solving, Pattern Recognition, Logical Deduction, and Compositional Synthesis (see Table~\ref{tab:ea1_tasks}).
    \item \textbf{Multimodal Understanding and Alignment (EA2):} Evaluates how well the model aligns, integrates, and interprets information across modalities. It is critical for tasks that demand accurate cross-referencing between text and visual data. Key capabilities to evaluate include cross-modal referencing and alignment image interpretation and description, and consistency in visual–textual context (see Table~\ref{tab:ea2_tasks}).
    \item \textbf{Complex Code Generation and Execution (EA3):} This aspect evaluates the model’s ability to interpret instructions, extract data from visual inputs, and generate executable code. It is essential for programming tasks where code must be both syntactically correct and logically coherent. The key evaluation capabilities include data extraction from visual input, programmatic transformation and logic construction, and accurate code synthesis, debugging, and explanation (see Table~\ref{tab:ea3_tasks}).
    \item \textbf{Knowledge Retrieval and Integration (EA4):} Focuses on the model’s ability to recall, verify, and merge factual information from textual and visual sources into coherent responses. This is critical for domains such as research, journalism, medicine, and data science. The key capabilities to evaluate include factual recall and verification, domain-specific context and explanation, and cross-modal knowledge synthesis (see Table~\ref{tab:ea4_tasks}).
\end{enumerate}


For each EA, we curated tasks to challenge models in realistic, application-oriented scenarios. Task objectives, and key challenges are summarized in the following tables and design rationales are elaborated in Appendices~\ref{apdx_ea1}, \ref{apdx_ea2}, \ref{apdx_ea3}, and \ref{apdx_ea4}. Full task descriptions and expected outputs are available in the supplementary material \ref{supp_EA1} to \ref{supp_EA4}.



\begin{table}[ht]
\caption{Overview of Evaluation Aspect 1 (EA1): Reasoning and Compositionality, comprising four tasks (T1 to T4) focused on visual pattern recognition, logical deduction, mathematical reasoning, and narrative synthesis.}
\label{tab:ea1_tasks}
\begin{minipage}{\textwidth}
\renewcommand{\arraystretch}{1.0}
\setlength{\tabcolsep}{8pt}
\begin{tabularx}{\textwidth}{p{3.0cm} X X}
\toprule
\textbf{Task(s)} & \textbf{Objective(s)} & \textbf{Key Challenges}\\
\midrule
\textbf{EA1\_T1:}\\
Pattern Recognition in Visual Sequences 
&
Test the model’s ability to detect and generalise patterns in a sequence of related images or diagrams.
&
Identifying logical/visual patterns and extrapolating rules from limited data.
\\
\midrule
\textbf{EA1\_T2:}\\
Logical Deduction from Text and Simplified Diagram
&
Evaluate the model’s capacity to interpret textual instructions alongside a simple diagram to reach a correct conclusion.
&
Integrating textual clues with diagrammatic cues and ensuring consistency in multi-step reasoning.
\\
\midrule
\textbf{EA1\_T3:}\\
Mathematical Puzzle with Visual Data
&
Assess how the model handles numeric computations and interprets simple visual representations (e.g., shapes or charts).
&
Bridging quantitative reasoning with visual elements and avoiding arithmetic errors.
\\
\midrule
\textbf{EA1\_T4:}\\
Story Synthesis from Text and Image
&
Check the model’s ability to create a coherent narrative by merging textual descriptions and a relevant image.
&
Maintaining logical flow in narrative form and blending visual context with text.
\\
\bottomrule
\end{tabularx}
\end{minipage}
\end{table}



\begin{table}[ht]
\caption{Overview of Evaluation Aspect 2 (EA2): Multimodal Understanding and Alignment, including four tasks (T1 to T4) that test the integration and interpretation of information across text, images, and charts.}
\label{tab:ea2_tasks}
\begin{minipage}{\textwidth}
\renewcommand{\arraystretch}{1.0}
\setlength{\tabcolsep}{8pt}
\begin{tabularx}{\textwidth}{p{3.0cm} X X}
\toprule
\textbf{Task(s)} & \textbf{Objective(s)} & \textbf{Key Challenge(s)}\\
\midrule
\textbf{EA2\_T1:}\\
Image-Text Matching and Explanation 
&
Verify the model’s capacity to match an image with a corresponding text description and explain the match.
&
Correctly identifying key features and ensuring textual alignment with visual elements.
\\
\midrule
\textbf{EA2\_T2:}\\
Inferring Context from Combined Modalities
&
Determine how the model integrates separate text and image inputs to deduce higher-level context.
&
Seamlessly fusing diverse information sources and handling ambiguous or incomplete data.
\\
\midrule
\textbf{EA2\_T3:}\\
Cross-Modal Translation
&
Evaluate the model’s ability to translate visual information (e.g., symbols or icons) into meaningful text.
&
Accurately handling symbolic representations and capturing fine-grained details.
\\
\midrule
\textbf{EA2\_T4:}\\
Aligning Data from Charts and Text
&
Assess how well the model interprets and aligns quantitative data from a chart with textual analysis.
&
Avoiding misinterpretation of graphical data and integrating numerical details with text.
\\
\bottomrule
\end{tabularx}
\end{minipage}
\end{table}



\begin{table}[ht]
\caption{Overview of Evaluation Aspect 3 (EA3): Complex Code Generation and Execution, comprising eight tasks (T1 to T8) that evaluate a model’s ability to generate executable code from visual and textual inputs across a range of structured reasoning and programming challenges.}
\label{tab:ea3_tasks}
\begin{minipage}{\textwidth}
\renewcommand{\arraystretch}{1.0}
\setlength{\tabcolsep}{8pt}
\begin{tabularx}{\textwidth}{p{3.0cm} X X}
\toprule
\textbf{Task(s)} & \textbf{Objective(s)} & \textbf{Key Challenges}\\
\midrule
\textbf{EA3\_T1:}\\
Data Visualization from an Image of a Table
&
Generate a script that converts table data in an image into a visualization (e.g., bar chart).
&
Handling OCR-like interpretation, mapping image data to structured format, and producing valid code.
\\
\midrule
\textbf{EA3\_T2:}\\
Drawing a Shape Based on an Image
&
Produce code that programmatically draws a shape using visual hints from an image.
&
Translating visual references into geometric coordinates and ensuring syntax correctness.
\\
\midrule
\textbf{EA3\_T3:}\\
Calculating a Sum from Text in an Image
&
Write a function to parse textual or numeric data from an image and compute a sum.
&
Accurate text extraction, handling parsing errors, and verifying arithmetic accuracy.
\\
\midrule
\textbf{EA3\_T4:}\\
Creating a Dictionary from an Image of a Chart
&
Convert labels and values in a chart image into a dictionary or key-value structure.
&
Extracting structured data, ensuring correct type conversion, and code clarity.
\\
\midrule
\textbf{EA3\_T5:}\\
Summing Prices from a Shopping List Image
&
Generate code to read item prices from an image of a shopping list and calculate the total.
&
Handling varied text formats, summing accurately, and managing currency symbols or decimals.
\\
\midrule
\textbf{EA3\_T6:}\\
Parsing a Simple CSV Structure from an Image
&
Build a script that interprets an image containing CSV-like text and converts it into a data table.
&
Accurate extraction of rows/columns and handling formatting inconsistencies.
\\
\midrule
\textbf{EA3\_T7:}\\
Generating Fibonacci Sequence Based on Image Instruction
&
Produce code to generate a Fibonacci sequence following instructions specified in an image.
&
Interpreting visual instructions accurately and ensuring logical code correctness.
\\
\midrule
\textbf{EA3\_T8:}\\
Responding to a Flowchart Image
&
Interpret a flowchart diagram and output code or logic to implement the described process.
&
Translating flowchart nodes into algorithmic steps and ensuring overall coherence.
\\
\bottomrule
\end{tabularx}
\end{minipage}
\end{table}



\begin{table}[ht]
\caption{Overview of Evaluation Aspect 4 (EA4): Knowledge Retrieval and Integration, comprising eight tasks (T1 to T8) that assess how effectively a model combines visual cues and textual context to retrieve, interpret, and explain domain-specific knowledge.}
\label{tab:ea4_tasks}
\begin{minipage}{\textwidth}
\renewcommand{\arraystretch}{1.0}
\setlength{\tabcolsep}{8pt}
\begin{tabularx}{\textwidth}{p{3.0cm} X X}
\toprule
\textbf{Task(s)} & \textbf{Objective(s)} & \textbf{Key Challenges}\\
\midrule
\textbf{EA4\_T1:}\\
Historical Monument Identification and Explanation
&
Identify a famous monument from an image and provide its historical context.
&
Handling historical facts, distinguishing similar monuments, and accurately explaining cultural significance.
\\
\midrule
\textbf{EA4\_T2:}\\
Scientific Data Interpretation from Graph and Text
&
Integrate textual and graphical data to answer a scientific question and summarize key findings.
&
Extracting relevant data from graphs, merging with textual context, and ensuring scientific accuracy.
\\
\midrule
\textbf{EA4\_T3:}\\
Medical Image Analysis with Knowledge Integration
&
Provide a brief medical interpretation from an image (e.g., an X-ray) along with a textual description of symptoms.
&
Understanding medical terminology, ensuring factual accuracy, and integrating visual and textual cues.
\\
\midrule
\textbf{EA4\_T4:}\\
Cultural Artifact Interpretation
&
Identify an artifact from an image and explain its cultural/historical background using textual clues.
&
Combining historical and cultural knowledge accurately and presenting a coherent explanation.
\\
\midrule
\textbf{EA4\_T5:}\\
Integrating Knowledge from a Map and Text Description
&
Combine visual map data with textual instructions to address location-based or geographical queries.
&
Accurately interpreting map symbols and reconciling textual directions with visual references.
\\
\midrule
\textbf{EA4\_T6:}\\
Integrating Information from a Chart and Article
&
Merge insights from a chart (e.g., population growth) with an accompanying article to produce a synthesized summary.
&
Ensuring correct numerical interpretation, linking data points to textual arguments, and forming a comprehensive summary.
\\
\midrule
\textbf{EA4\_T7:}\\
Multimodal Fact Checking
&
Verify the factual accuracy of a statement by cross-referencing an image (e.g., a photograph) with textual sources.
&
Cross-validating visual evidence with textual claims and identifying potential inconsistencies.
\\
\midrule
\textbf{EA4\_T8:}\\
Integrating Visual Art and Historical Context
&
Explain an artwork shown in an image, including its historical context and cultural details.
&
Recognizing artistic styles, contextualizing the piece historically, and referencing relevant artistic movements.
\\
\bottomrule
\end{tabularx}
\end{minipage}
\end{table}

\subsection{Stage 2: Review and Selection of MLLMs} \label{modelselection}

This stage consists of two parts. First, we review key models current MLLM landscape, encompassing both proprietary and open‐source developments. Second, we detail the selection criteria and present the 13 open‐source MLLMs chosen for our evaluation.

Proprietary models such as OpenAI’s GPT-4o, GPT-4.5 Preview \citep{OpenAI2025Models}, Anthropic’s Claude 3 \citep{anthropic2024claude}, and Google’s Gemini series \citep{DeepMind2025Gemini} demonstrate strong multimodal reasoning, particularly in instruction-following and contextual understanding. For instance, GPT-4 introduced multimodal inputs to process both text and images, while Claude 3 (including its variants Claude 3 Opus, Sonnet, and Haiku) has been optimized for text–image reasoning tasks. Despite their robust performance, the closed-source nature of these models limits customization, transparency, and independent research. In contrast, the open-source ecosystem has rapidly evolved, providing competitive alternatives with enhanced accessibility and community-driven improvements. Models such as Meta’s LLaMA series \citep{meta2024llama3}, Mistral \citep{mistral2024pixtral}, Falcon \citep{malartic2024falcon2}, and DeepSeek \citep{deepseek2025deepseek} offer greater flexibility in fine-tuning and deployment. Nonetheless, open-source MLLMs still face challenges in matching proprietary models’ instruction-following, contextual learning, and multimodal alignment. A comprehensive discussion of the various distinct MLLMs including their architectural designs, multimodal processing capabilities, and evaluation suitability is provided in Appendix~\ref{apdx_review_mllm}.

After conducting an in-depth review of various models, several factors influenced the exclusion of specific models from our selection:
\begin{itemize}
    \item Modality Focus – Models explicitly designed for video, audio, or long-sequence temporal reasoning (e.g., VITA \citep{fu2024vita}, Long-VITA \citep{shen2025long}, mPLUG-Owl3 \citep{ye2024mplug}) were not included, as our evaluation focuses on text-image multimodal reasoning rather than temporal or multi-frame processing.
    \item Task-Specific Specialization – Some models are highly optimised for specific domains rather than general-purpose multimodal reasoning. For example, MoAI \citep{lee2024moai} is designed for OCR-centric tasks, ChatRex \citep{jiang2024chatrex} focuses on object detection, and ViP-LLaVA \citep{cai2024vip} supports region-aware multimodal interaction. As a result, these models are less suitable for a structured comparative analysis.
    \item Reliance on External Modules – Models like Molmo \citep{deitke2024molmo} and Cambrian-1 \citep{tong2025cambrian}, which incorporate external retrieval mechanisms or specialized visual processing modules, introduce dependencies that complicate direct performance comparisons across standardized benchmarks.
    \item Performance-Compute Tradeoff – While Falcon2-11B \citep{malartic2024falcon2} and MiniCPM \citep{yao2024minicpm} are highly efficient, they are optimized for lightweight multimodal interactions rather than advanced vision-language compositionality, making them less aligned with our focus on deeper reasoning and complex multimodal integration.
    \item Maturity and Benchmarking Limitations – Some recently released models, such as Meteor \citep{lee2025meteor} and Cambrian-1 \citep{tong2025cambrian} lack comprehensive benchmarking on widely used multimodal datasets, making it difficult to systematically compare their performance against well-established counterparts.
\end{itemize}

Taking these factors into account, we structured our model selection process to ensure a balanced, computationally feasible, and diverse evaluation of open-source MLLMs. 

We selected 13 open-source MLLMs for detailed evaluation, focusing on models that offer a balanced combination of scalability, architectural diversity, and multimodal reasoning capabilities. Our selection criteria include:
\begin{itemize}
    \item Parameter Scale and Architectural Diversity - Models are categorized as Small ($<$4B), Medium (4B–10B), and Large ($>$10B) to capture variations in computational demands and performance.
    \item Multimodal Integration - Preference was given to models integrating different vision encoders (e.g., ViT, SigLIP, CLIP, EVA) with various language models (e.g., Qwen, Gemma, Llama, Phi), enabling an exploration of how vision–language pairings influence task performance.
    \item Open-Source Availability - Only models with fully open-source code and weights were considered to ensure reproducibility and community accessibility.
    \item Practical Feasibility - Due to GPU constraints, models exceeding 15B parameters were generally excluded from direct evaluation. However, we incorporated one model above 15B parameters using quantization techniques to examine the trade-offs between model size and inference efficiency.
\end{itemize}

For each category, four models were selected (with five in the large model group due to the inclusion of a quantized model). This strategy ensures a fair comparison across small, medium, and large models while addressing key questions related to parameter scaling, vision–language integration, and computational efficiency.

Table~\ref{tab:model_comparison} summarizes the final list of models along with key specifications such as parameter count, release date, underlying language and vision models, and image input support.



\begin{table}[ht]
\caption{List of Models Finalized for Evaluation. This table includes the model name, parameter size, release date, and details of the language and vision model combination.}
\label{tab:model_comparison}
\centering
\resizebox{\textwidth}{!}{%
\begin{tabularx}{\textwidth}{
  >{\raggedright\arraybackslash}m{0.22\textwidth}  % Model
  >{\centering\arraybackslash}m{0.14\textwidth}    % Params (B)
  >{\centering\arraybackslash}m{0.14\textwidth}    % Release Date
  >{\raggedright\arraybackslash}m{0.23\textwidth}  % Language Model
  >{\raggedright\arraybackslash}m{0.23\textwidth}  % Vision Model
}
\toprule
\textbf{Model} & 
\textbf{Params (B)} & 
\textbf{Release Date} & 
\textbf{Language Model} & 
\textbf{Vision Model} \\
\midrule
InternVL-2 & 1 & 8-Jul-24 & Qwen2.5-0.5B & InternViT-300M \\
\midrule
Qwen2-VL & 2 & 29-Aug-24 & Qwen2-1.5B & ViT-600M \\
\midrule
MiniMonkey & 2.2 & 9-Aug-24 & InternLM2-1.8B & InternViT-300M \\
\midrule
Paligemma-3B-mix-448 & 3 & 14-May-24 & Gemma-2B & SigLIP-400M \\
\midrule
Phi-3.5 VLM & 4 & 21-May-24 & Phi-3.5 & CLIP ViT-L/14 \\
\midrule
LLaVA OneVision-7B & 8 & 14-Sep-24 & Qwen2-7B & SigLIP-400M \\
\midrule
Ovis 1.5-Llama 3-8B & 8 & 17-Jun-24 & Llama-3-8B-Instruct & SigLIP-400M \\
\midrule
GLM-4v-9B & 9 & 30-Jul-24 & GLM-4-9B & EVA-02-5B \\
\midrule
Ovis-1.6 & 10.2 & 17-Jun-24 & Gemma2-9B-lt & SigLIP-400M \\
\midrule
Llama3.2-Vision & 11 & 25-Sep-24 & Llama 3.1 & ViT \\
\midrule
Pixtral & 12 & 17-Sep-24 & Nemo-12B & ViT-400M \\
\midrule
OmChat V2 & 13 & 6-Jul-24 & Qwen2-7B & InternViT-6B \\
\midrule
InternVL-2 & 26 & 8-Jul-24 & InternLM2-20B & InternViT-6B \\
\bottomrule
\end{tabularx}%
}
\end{table}





\subsection{Stage 3: Selection of Prompt Engineering Methods} 
\label{PromptEnggTechs}

Understanding multimodal content requires deep multimodal knowledge and models must not only grasp information within each modality but also accurately infer how these modalities interact to support effective reasoning \citep{yang2023mm}. Prompt engineering has emerged as a straightforward and efficient method for guiding LLMs and MLLMs, enabling enhanced performance in complex reasoning tasks. This approach generally involves two types of prompts: instruction-based and example-based \citep{bhattacharjya2024foundation}. Instruction-based prompts include system-level prompts that establish overarching guidelines and task-specific prompts tailored to particular objectives, while example-based prompts rely on a few illustrative examples to define desired input-output relationships.

Numerous studies \citep{jiang2022promptmaker, zamfirescu2023johnny} have identified two key challenges in prompting: crafting effective prompts and evaluating their efficacy. In particular, example-based approaches have proven effective in guiding large models \citep{mann2020language, wei2022chain, yao2024tree}. Although many interactive systems support prompt engineering, most focus predominantly on textual or limited visual inputs. This narrow focus overlooks the intricate interactions between modalities, thereby limiting the development of prompts that fully leverage the contextual richness of multimodal inputs to enhance reasoning \citep{zamfirescu2023johnny}.

Models can quickly adapt to new downstream tasks in few-shot or even zero-shot settings without requiring retraining \citep{liu2023pre}. As exemplified by the pioneering Chain-of-Thought (CoT) prompting technique, which prompts LLMs to generate intermediate reasoning steps, mirroring human cognitive processes \citep{wei2022chain, kojima2022large, zhang2022automatic}; the concept has been extended to the multimodal domain (M-CoT) in several studies \citep{rose2023visual, zhang2023multimodal, ge2023chain}. Analogical Reasoning Prompting leverages shared structural similarities between scenarios to unlock a model’s analogical reasoning abilities \citep{yasunaga2023large}, while Generated Knowledge Prompting encourages models to generate additional background knowledge to enhance reasoning \citep{liu2021generated, liu2023pre}. Tree-of-Thought (ToT) prompting further extends CoT by structuring reasoning into a decision tree that explores multiple pathways before converging on a solution \citep{yao2024tree}.

To effectively guide MLLMs across diverse tasks, we adopt and implement seven distinct prompting techniques:

\begin{enumerate}
    \item Zero-Shot Prompting \citep{radford2019language}
    \item One-Shot Prompting \citep{mann2020language}
    \item Few-Shot Prompting \citep{mann2020language}
    \item Chain-of-Thought (CoT) Prompting \citep{wei2022chain}
    \item Analogical Prompting \citep{yasunaga2023large}
    \item Generated Knowledge Prompting \citep{liu2021generated}
    \item Tree-of-Thought (ToT) Prompting \citep{yao2024tree}
\end{enumerate}

For detailed reviews, prompt templates, and usage scenarios for each prompting technique, please refer to Appendix~\ref{apdx_prompt_methods}. By systematically designing and refining these prompts, our approach aims to generate consistent, rationale-driven outputs across a wide range of multimodal tasks.

\subsection{Stage 4: Evaluation Framework} \label{manualevaluationcriteria}

Our evaluation framework is designed to rigorously assess model outputs across multiple tasks and prompting techniques. All experiments were performed on high-performance Nvidia GPUs (see Appendix~\ref{experimental_setup}). To ensure consistency, inference parameters, including temperature, maximum token length, and decoding strategies were held constant across all models.
We assessed model outputs along two primary dimensions: task performance and resource consumption. Task performance metrics include accuracy, relevancy, conciseness, and hallucination. Accuracy captures whether the response correctly addresses all components of the task. Relevancy assesses how well the response aligns with the task’s context and objectives. Conciseness evaluates the clarity and brevity of the response. Hallucination measures the extent to which responses include irrelevant, redundant, or fabricated content.
Resource consumption metrics include inference time and memory usage, recorded to evaluate model efficiency.

To conduct a structured evaluation, two expert annotators independently assessed the model outputs for each task. For each of the four dimensions such as Accuracy, Hallucination, Relevance, and Conciseness, we used well-defined scoring rubrics (see Appendix E, ~\ref{tab:detailed_evalcriteria}). Accuracy was assessed against expected ground truth answers. For the other three aspects, annotators applied the scoring definitions to assign a rating per response. To reduce bias and improve consistency, each annotator reviewed a disjoint set of tasks and subsequently cross-validated a subset of each other’s evaluations. This peer-review process ensured agreement on borderline cases and enabled calibration between annotators.

Table~\ref{tab:evaluation_thresholds} summarizes the empirical thresholds established for these metrics. Table~\ref{tab:detailed_evalcriteria} supplement these empirical thresholds to ensure consistent assessment of model performance.

\begin{table}[ht]
    \caption{Empirical thresholds for evaluation metrics based on industry benchmarks and prior research \citep{DeepMind2025Gemini, jiangmmad, huang2023language, zhang2024mme, adler2024gpt, meta2024llama32}. The Accuracy metric determines if all task elements are correctly addressed, while Relevancy ensures that responses remain contextually aligned. Conciseness evaluates clarity and brevity, and Hallucination flags irrelevant or repetitive content.}
    \label{tab:evaluation_thresholds}
    \centering
    \renewcommand{\arraystretch}{1.0}
    \setlength{\tabcolsep}{12pt}
    \begin{tabular}{|l|c|}
        \hline
        \textbf{Metric} & \textbf{Threshold} \\
        \hline
        Accuracy       & $\geq 80\%$ \\
        Hallucination  & $< 5\%$ \\
        Relevancy      & $\geq 90\%$ \\
        Conciseness    & $\geq 80\%$ direct, $< 10\%$ under-explained \\
        \hline
    \end{tabular}
\end{table}



% These criteria are tailored for specific evaluation aspects (EAs). For example:
% \begin{itemize}
%     \item \textbf{Reasoning:} Focuses on the logical flow and completeness of multi-step reasoning.
%     \item \textbf{Knowledge Retrieval:} Emphasizes factual accuracy and completeness.
%     \item \textbf{Model Understanding:} Assesses consistency and clarity of explanations.
%     \item \textbf{Code Generation:} Evaluates syntactic correctness, logical execution, and clarity of accompanying explanations.
% \end{itemize}

\subsubsection{Inter-Annotator Agreement and Evaluation Consistency}
% To ensure objectivity, two independent annotators evaluated model outputs using the criteria in Table~\ref{tab:evaluation_thresholds}. Annotator A assessed EA1 and EA4, while Annotator B evaluated EA2 and EA3. A cross-review process was implemented in which each annotator reviewed the other's scores to identify discrepancies, which were resolved via consensus. A structured guideline document was used to standardize interpretation of each metric. (this paragraph as in the version 1 of TMLR)

%below is the new and more descriptive version according to the reviewer 9DC2 https://openreview.net/forum?id=B1L8HrjoA1&noteId=v8v3KXqtja) 

Since manual evaluation has been adopted to assess model responses across multiple dimensions such as correctness, hallucination, relevancy, and conciseness, it is essential to ensure consistency and reliability in the evaluation process. To achieve this, we implemented a structured cross-review methodology, ensuring that the scoring process remained objective and reproducible.

% Annotation Process and Cross-Review Strategy
The evaluation was conducted by two independent annotators, each responsible for assessing specific Evaluation Aspects (EAs). To maintain consistency and minimize subjectivity, the following structured approach was adopted. 

% Annotator Assignment  Initial Evaluation
\textbf{Annotator A }evaluated EA1 and EA4, while \textbf{Annotator B} evaluated EA2 and EA3.
Each annotator assessed the model outputs independently, applying the predefined evaluation criteria outlined in Table \ref{tab:detailed_evalcriteria}.

% Cross-Review for Consistency
To validate scoring consistency, the annotators cross-reviewed each other’s evaluations: Annotator B reviewed EA1 and EA4, originally assessed by Annotator A. Annotator A reviewed EA2 and EA3, originally assessed by Annotator B. The cross-review focused on identifying discrepancies in accuracy, relevancy, conciseness, and hallucination assessments, ensuring that both annotators followed the evaluation framework consistently.

% Resolution of Discrepancies
If any discrepancies were identified during the cross-review, they were flagged for discussion between the two annotators. Consensus resolution was prioritized, where both annotators examined the specific response, revisited the criteria, and agreed on a final label. If necessary, criteria definitions were refined to enhance clarity for future evaluations.

% Ensuring Consistency in Evaluation Metrics
A structured guideline document was used by both annotators to ensure uniformity in interpretation. Each evaluation metric such as Accuracy, Relevancy, Conciseness, and Hallucination was defined with clear decision rules to ensure judgments were consistent across different EAs. By implementing this cross-review and consensus resolution process, we ensured a high level of reliability and reproducibility in our evaluation. The application of this methodology across different EAs, along with any refinements made based on annotator discussions.


\subsubsection{Threshold Selection}
We employ a range of metrics to evaluate the performance of AI models, each of which must be quantified to ensure objective assessment. The primary objective of this quantification process is to evaluate various prompt engineering techniques across different models for specific tasks. Our goal is to identify an optimal combination where a given prompt enables the model to generate accurate results with minimal hallucination, high relevance, and concise yet effective explanations. To achieve this, we have established thresholds for each metric, ensuring a structured and rigorous evaluation framework.

The accuracy threshold for the models has been set at 80\% or higher, as maintaining a high accuracy level is crucial given their training on extensive datasets. While the standard benchmark for accuracy is typically 75\%, we have opted for 80\% to ensure improved reliability and effectiveness beyond this baseline. Many industry benchmarks and model evaluation studies support this decision, suggesting that maintaining an accuracy above 80\% is desirable for real-world applications \citep{jiangmmad}. For instance, multimodal models evaluated in benchmark tests like MMLU and Visual Question Answering (VQA) often struggle to exceed 75\%, highlighting the challenge of ensuring consistency across diverse tasks. However, in high-stakes domains such as autonomous systems, healthcare AI, and industrial automation, achieving at least 80\% accuracy is often necessary to minimize errors and improve decision-making reliability \citep{huang2023language, zhang2024mme}. Additionally,  frontier industries leading AI research indicate that state-of-the-art models consistently target higher accuracy rates to enhance generalization across multimodal inputs \citep{adler2024gpt, DeepMind2025Gemini, meta2024llama32}. Therefore, while 75\% serves as a common baseline in general AI benchmarks, raising the threshold to 80\% reinforces a commitment to higher performance standards and better real-world applicability.

While LLMs and MLLMs have emerged as powerful tools capable of tackling complex problems through reasoning, explanation, summarization, interpretation, and retrieval, they are also prone to generating false or irrelevant information, a phenomenon commonly referred to as hallucination. This study explores the different types of hallucinations that arise from various prompting techniques, which are ideally used to mitigate hallucinations in these advanced models. However, as discussed in depth in this study \citep{chakraborty2025hallucination}, even state-of-the-art prompting mechanisms cannot completely eliminate hallucinations. Given the critical role hallucination plays in model reliability, we have specifically evaluated hallucination as a key component of our study. Recent research confirms that leading LLMs exhibit hallucination rates within a measurable range. For instance, Google Gemini-2.0-Flash-001 has a hallucination rate of 0.7\%, OpenAI GPT-4.5 Preview hallucinates 1.2\% of the time, and Claude-3.5-Sonnet exhibits a 4.6\% hallucination rate, as reported in the Vectara Hallucination Leaderboard \citep{vectara_hallucination_leaderboard} and Galileo’s Hallucination Index \citep{galileo_hallucination_index}. These results align with established research on hallucination detection, including  \citep{laban2022summac} and \citep{honovich2022true}, both of which emphasize the importance of strict factual adherence in AI-generated summaries. Based on extensive evaluations, a hallucination threshold of $<$ 5\% is a reasonable and empirically supported benchmark for ensuring high factual accuracy in AI systems, particularly in domains requiring trustworthy and verifiable outputs. Since our evaluation focuses on complex tasks requiring reasoning and accurate outputs, we expect minimal hallucination. This Hallucination threshold/rate exhibit significant degradation in factual grounding, where misinformation could have severe consequences, particularly in long-context reasoning.

Regarding relevancy, the model must provide highly relevant results. For fully relevant outputs, a minimum threshold of 90\% has been set, ensuring that the model consistently delivers accurate and meaningful responses. However, the model must not generate irrelevant results under any circumstances, reinforcing the expectation of precise and meaningful responses.

In terms of conciseness, the model is expected to provide explanations that align with the complexity of the tasks. Under-explained responses should be kept below 10\%, as these tasks require the model to justify its conclusions effectively. For to-the-point explanations, at least 80\% of the model's responses should be direct and concise. However, some room is allowed for over-explained responses, particularly in cases where prompting techniques, such as analogical prompting, necessitate additional elaboration. Manual analysis has shown that in such cases, generating more analogies can be beneficial to comprehension.

This structured approach ensures that the model performs optimally across multiple dimensions, balancing accuracy, hallucination control, relevancy, and conciseness to deliver high-quality, reliable, and meaningful results.


\section{Results}
\label{sec:results}

This section presents a comprehensive evaluation of model performance across multiple evaluation aspects (EAs) using a diverse set of tasks. The evaluation encompasses seven aforementioned prompting techniques and key performance indicators such as Accuracy, Hallucination control, Response Relevance, Irrelevance, and Conciseness. For Conciseness, results are divided into Under-Explained (UE) and a combined measure of Target Precision and Over-Explained (TP + OE).

Models are categorized based on their parameter count as follows:
\begin{itemize}
    \item Small MLLMs (\(<4\)B parameters),
    \item Medium MLLMs (4B--10B parameters),
    \item Large MLLMs (\(>10\)B parameters).
\end{itemize}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Summary of Average Model Performance Across Evaluation Aspects}
\label{sec:summaryresults}

Fig~\ref{fig:EA1perfmetricfig}--\ref{fig:EA4perfmetricfig} report the average performance (in \%) for each model category (Small, Medium, and Large MLLMs) under the seven prompting techniques across the four evaluation aspects.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \subsubsection{EA1: Reasoning and Compositionality Tasks}
In EA1 (Reasoning and Compositionality) tasks, which assess a model’s ability to perform multi-step problem solving and integrate information across modalities, Fig ~\ref{fig:EA1perfmetricfig} highlights several key findings. Few-shot prompting yields the highest accuracy for large multimodal language models (MLLMs), achieving 45\% and outperforming both Chain-of-Thought and Tree-of-Thought prompting strategies. In contrast, small MLLMs exhibit notably higher hallucination rates, reaching up to 75\% when using Tree-of-Thought, while medium and large models demonstrate substantially lower hallucination levels. Relevance scores remain consistently high for large MLLMs, with performance exceeding 90\%, whereas small models tend to lag in aligning their responses with task context. Finally, in terms of conciseness, small MLLMs are more likely to produce under-explained or verbose responses, in contrast to medium and large models, which maintain clearer and more concise outputs.

% \subsubsection{EA2: Multimodal Understanding and Alignment Tasks}
In EA2 (Multimodal Understanding and Alignment) tasks, which assess how well the model aligns, integrates, and interprets information across modalities, Fig ~\ref{fig:EA2perfmetricfig} summarizes the performance on four tasks across three model sizes of MLLMs. Large and medium MLLMs achieve near-perfect relevance, close to 100\% when using Zero-Shot, One-Shot, and Few-Shot prompting techniques. In contrast, small MLLMs demonstrate higher hallucination rates, particularly when using Tree-of-Thought prompting. While accuracy across all model sizes remains moderate, small MLLMs tend to score lower on average. Conciseness metrics further reveal that small models often produce shorter, under-explained responses, whereas medium and large models provide more complete and detailed outputs.

% \subsubsection{EA3: Complex Code Generation and Execution Tasks}
For complex code generation and execution tasks under EA3, the results summarized in Fig ~\ref{fig:EA3perfmetricfig} indicate that large MLLMs achieve the highest accuracy, reaching up to 96.88\% with Few-Shot prompting. Hallucination levels are nearly zero in medium and large MLLMs across several prompting methods, whereas small MLLMs exhibit considerably higher hallucination. Relevance scores remain uniformly high, approaching 100\% for medium and large MLLMs. In terms of response quality, small MLLMs tend to produce under-explained outputs, while medium and large models generate more balanced and detailed code responses. Notably, EA3 displays a distinct conciseness pattern compared to other evaluation aspects, with Large MLLMs showing extremely low UE scores (as low as 3.12\%) and very high TP+OE scores (often nearly and above 90\%). This reflects the nature of code generation tasks, where explicit reasoning and fully elaborated code listings are typically prioritised to ensure correctness and completeness. 

% \subsubsection{EA4: Knowledge Retrieval and Integration Tasks}
In knowledge retrieval and integration tasks (EA4), Fig ~\ref{fig:EA4perfmetricfig} shows that large MLLMs achieve the highest accuracy, up to 87.5\%, along with near-perfect relevance (close to 100\%), particularly when using Zero-Shot prompting. In contrast, small MLLMs display higher hallucination rates, exceeding 40\% in some cases, while medium MLLMs fall between the two in terms of performance. Additionally, medium MLLMs are more likely to produce under-explained outputs compared to both small and large categories.

Notably, conciseness in EA3 requires a different interpretation than in other EAs. In EA1, EA2, and EA4, excessive explanation may reduce clarity, making concise, to-the-point answers preferable. In contrast, for EA3’s code generation tasks, detailed, step-by-step reasoning and fully commented code (reflected in higher TP+OE scores and lower UE) are advantageous because they improve correctness, reproducibility, and debugging. The observed trend in EA3, large models producing more fully explained outputs and fewer under-explained responses; therefore reflects a task-appropriate conciseness preference, even though it diverges from the patterns seen in non-code EAs.

\begin{figure}[ht]
    \centering
    \includegraphics[width=1\textwidth, angle=0]{Figures_main/radarsmllm-EA1-annotated.png}
    \caption{EA1: Reasoning and Compositionality Tasks Results Summary. Radar plots (a–g) present the average performance (in \%) of Small (S-MLLMs, \(<4\)B), Medium (M-MLLMs, 4B--10B), and Large (L-MLLMs, \(>10\)B) models across prompting techniques: (a) ZS = Zero-Shot, (b) OS = One-Shot, (c) FS = Few-Shot, (d) CoT = Chain-of-Thought, (e) Anl = Analogical, (f) GK = Generated Knowledge, and (g) ToT = Tree-of-Thought. Performance metrics include Accuracy, Hallucination, Relevance (Fully and Partially Relevant), Irrelevance, and Conciseness (Under-Explained – UE, To the Point – TP, and Over-Explained – OE). These plots illustrate how prompting strategies and model scale shape reasoning behaviour. See Section~\ref{sec:summaryresults} for detailed discussion and Table~\ref{tab:reasoning_results_summary} in the appendix for comprehensive statistics.}
    \label{fig:EA1perfmetricfig}
\end{figure}


\begin{figure}[ht]
    \centering
    \includegraphics[width=1\textwidth, angle=0]{Figures_main/radarsmllm-EA2-annotated.png}
    \caption{EA2: Multimodal Understanding and Alignment Tasks Results Summary. Radar plots (a–g) present the average performance (in \%) of Small (S-MLLMs, \(<4\)B), Medium (M-MLLMs, 4B--10B), and Large (L-MLLMs, \(>10\)B) models across prompting techniques: (a) ZS = Zero-Shot, (b) OS = One-Shot, (c) FS = Few-Shot, (d) CoT = Chain-of-Thought, (e) Anl = Analogical, (f) GK = Generated Knowledge, and (g) ToT = Tree-of-Thought. Performance metrics include Accuracy, Hallucination, Relevance (Fully and Partially Relevant), Irrelevance, and Conciseness (Under-Explained – UE, To the Point – TP, and Over-Explained – OE). These plots illustrate how different prompting strategies and model scales influence multimodal understanding and alignment. See Section~\ref{sec:summaryresults} for detailed discussion and Table~\ref{tab:model_understanding_tasks_summary_results} in the appendix for comprehensive statistics.}
    \label{fig:EA2perfmetricfig}
\end{figure}


\begin{figure}[ht]
    \centering
    \includegraphics[width=1\textwidth, angle=0]{Figures_main/radarsmllm-EA3-annotated.png}
    \caption{EA3: Complex Code Generation and Execution Tasks Results Summary. Radar plots (a–g) show the average performance (in \%) of Small (S-MLLMs, \(<4\)B), Medium (M-MLLMs, 4B--10B), and Large (L-MLLMs, \(>10\)B) models across prompting techniques: (a) ZS = Zero-Shot, (b) OS = One-Shot, (c) FS = Few-Shot, (d) CoT = Chain-of-Thought, (e) Anl = Analogical, (f) GK = Generated Knowledge, and (g) ToT = Tree-of-Thought. Performance metrics include Accuracy, Hallucination, Relevance (Fully and Partially Relevant), Irrelevance, and Conciseness (Under-Explained – UE, To the Point – TP, and Over-Explained – OE). These plots highlight how model scale and prompting strategies affect performance in code generation and execution tasks. See Section~\ref{sec:summaryresults} for detailed discussion and Table~\ref{tab:code_generation_tasks_summary_results} in the appendix for comprehensive statistics.}
    \label{fig:EA3perfmetricfig}
\end{figure}


\begin{figure}[ht]
    \centering
    \includegraphics[width=1\textwidth, angle=0]{Figures_main/radarsmllm-EA4-annotated.png}
    \caption{EA4: Knowledge Retrieval and Integration Tasks Results Summary. Radar plots (a–g) present the average performance (in \%) of Small (S-MLLMs, \(<4\)B), Medium (M-MLLMs, 4B--10B), and Large (L-MLLMs, \(>10\)B) models across prompting techniques: (a) ZS = Zero-Shot, (b) OS = One-Shot, (c) FS = Few-Shot, (d) CoT = Chain-of-Thought, (e) Anl = Analogical, (f) GK = Generated Knowledge, and (g) ToT = Tree-of-Thought. Performance metrics include Accuracy, Hallucination, Relevance (Fully and Partially Relevant), Irrelevance, and Conciseness (Under-Explained – UE, To the Point – TP, and Over-Explained – OE). These plots demonstrate how different prompting strategies and model sizes influence factual retrieval and integration capabilities. See Section~\ref{sec:summaryresults} for detailed discussion and Table~\ref{tab:knowledge_retrieval_results_summary} in the appendix for comprehensive statistics.}
    \label{fig:EA4perfmetricfig}
\end{figure}



% \begin{table}[ht]
% \centering
% \caption{
% EA1: Reasoning and Compositionality Tasks Results Summary. This table presents the average performance (in \%) of Small (S-MLLMs, \(<4\)B), Medium (M-MLLMs, 4B--10B), and Large (L-MLLMs, \(>10\)B) models on reasoning tasks. Performance metrics include Accuracy, Hallucination, Relevance (Fully and Partially Relevant), Irrelevance, Conciseness (Under Explained - UE, Over Explained - OE). Abbreviations: ZS = Zero-Shot, OS = One-Shot, FS = Few-Shot, CoT = Chain-of-Thought, Anl = Analogical, GK = Generated Knowledge, ToT = Tree-of-Thought. See \ref{sec:summaryresults} for detailed discussion and interpretation.
% }
% \label{tab:reasoning_results_summary}

% \renewcommand{\arraystretch}{0.9}
% \setlength{\tabcolsep}{4pt}
% \begin{tabular}{|p{3.5cm}|p{1.7cm}|p{1.2cm}|p{1.2cm}|p{1.2cm}|p{1.2cm}|p{1.2cm}|p{1.2cm}|p{1.2cm}|}
% \hline
% \textbf{Metrics} & \textbf{Size} & ZS & OS & FS & CoT & Anl & GK & ToT \\ \hline
% Accuracy 
%   & S-MLLMs & 31.25 & 18.75 & 31.25 & 18.75 & 25 & 6.25 & 6.25 \\ \cline{2-9}
%   & M-MLLMs & 31.25 & 37.5 & 31.25 & 18.75 & 37.5 & 25 & 18.75 \\ \cline{2-9}
%   & L-MLLMs & 30 & 30 & 45 & 25 & 25 & 20 & 35 \\ \hline
% Hallucination 
%   & S-MLLMs & 37.5 & 50 & 37.5 & 31.25 & 37.5 & 50 & 75 \\ \cline{2-9}
%   & M-MLLMs & 6.25 & 12.5 & 0 & 12.5 & 0 & 12.5 & 43.75 \\ \cline{2-9}
%   & L-MLLMs & 0 & 0 & 5 & 15 & 5 & 15 & 30 \\ \hline
% Relevance (F + P)
%   & S-MLLMs & 75 & 56.25 & 62.5 & 75 & 68.75 & 56.25 & 56.25 \\ \cline{2-9}
%   & M-MLLMs & 93.75 & 87.5 & 81.25 & 100 & 100 & 100 & 75 \\ \cline{2-9}
%   & L-MLLMs & 95 & 90 & 90 & 90 & 95 & 90 & 95 \\ \hline
% Irrelevance 
%   & S-MLLMs & 25 & 43.75 & 37.5 & 25 & 31.25 & 43.75 & 43.75 \\ \cline{2-9}
%   & M-MLLMs & 6.25 & 12.5 & 18.75 & 0 & 0 & 0 & 25 \\ \cline{2-9}
%   & L-MLLMs & 5 & 10 & 10 & 10 & 5 & 10 & 5 \\ \hline
% Conciseness (UE)
%   & S-MLLMs & 25 & 12.5 & 18.75 & 25 & 31.25 & 6.25 & 0 \\ \cline{2-9}
%   & M-MLLMs & 50 & 50 & 43.75 & 43.75 & 62.5 & 43.75 & 43.75 \\ \cline{2-9}
%   & L-MLLMs & 50 & 75 & 70 & 50 & 60 & 45 & 50 \\ \hline
% Conciseness (TP+OE)
%   & S-MLLMs & 75 & 87.5 & 81.25 & 75 & 68.75 & 93.75 & 100 \\ \cline{2-9}
%   & M-MLLMs & 50 & 50 & 56.25 & 56.25 & 37.5 & 56.25 & 56.25 \\ \cline{2-9}
%   & L-MLLMs & 50 & 25 & 30 & 50 & 40 & 55 & 50 \\ \hline
% \end{tabular}
% \end{table}

% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


% \begin{table}[ht]
% \centering
% \caption{
% EA2: Multimodal Understanding and Alignment Tasks Results Summary. This table displays the average performance (in \%) of Small (S-MLLMs, \(<4\)B), Medium (M-MLLMs, 4B--10B), and Large (L-MLLMs, \(>10\)B) models on tasks requiring multimodal understanding. Performance metrics include Accuracy, Hallucination, Relevance (Fully and Partially Relevant), Irrelevance, Conciseness (Under Explained - UE, Over Explained - OE). Abbreviations: ZS = Zero-Shot, OS = One-Shot, FS = Few-Shot, CoT = Chain-of-Thought, Anl = Analogical, GK = Generated Knowledge, ToT = Tree-of-Thought. See \ref{sec:summaryresults} EA2.
% \label{tab:model_understanding_tasks_summary_results}
% }
% \renewcommand{\arraystretch}{0.9}
% \setlength{\tabcolsep}{4pt}
% \begin{tabular}{|p{3.5cm}|p{1.7cm}|p{1.2cm}|p{1.2cm}|p{1.2cm}|p{1.2cm}|p{1.2cm}|p{1.2cm}|p{1.2cm}|}
% \hline
% \textbf{Metrics} & \textbf{Size} & ZS & OS & FS & CoT & Anl & GK & ToT \\ \hline
% Accuracy 
%   & S-MLLMs & 31.25 & 6.25 & 12.5 & 12.5 & 18.75 & 6.25 & 0 \\ \cline{2-9}
%   & M-MLLMs & 43.75 & 37.5 & 37.5 & 37.5 & 56.25 & 43.75 & 43.75 \\ \cline{2-9}
%   & L-MLLMs & 56.25 & 43.75 & 37.5 & 43.75 & 37.5 & 37.5 & 43.75 \\ \hline
% Hallucination 
%   & S-MLLMs & 12.5 & 25 & 37.5 & 25 & 37.5 & 43.75 & 50 \\ \cline{2-9}
%   & M-MLLMs & 0 & 0 & 0 & 0 & 6.25 & 6.25 & 6.25 \\ \cline{2-9}
%   & L-MLLMs & 0 & 6.25 & 0 & 0 & 25 & 6.25 & 12.5 \\ \hline
% Relevance (F + P)
%   & S-MLLMs & 87.5 & 75 & 75 & 81.25 & 75 & 81.25 & 62.5 \\ \cline{2-9}
%   & M-MLLMs & 100 & 100 & 100 & 100 & 100 & 100 & 100 \\ \cline{2-9}
%   & L-MLLMs & 100 & 100 & 100 & 100 & 93.75 & 93.75 & 93.75 \\ \hline
% Irrelevance 
%   & S-MLLMs & 12.5 & 25 & 25 & 18.75 & 25 & 18.75 & 37.5 \\ \cline{2-9}
%   & M-MLLMs & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ \cline{2-9}
%   & L-MLLMs & 0 & 0 & 0 & 0 & 6.25 & 6.25 & 6.25 \\ \hline
% Conciseness (UE)
%   & S-MLLMs & 50 & 50 & 50 & 37.5 & 37.5 & 31.25 & 31.25 \\ \cline{2-9}
%   & M-MLLMs & 81.25 & 93.75 & 87.5 & 75 & 56.25 & 50 & 56.25 \\ \cline{2-9}
%   & L-MLLMs & 68.75 & 68.75 & 75 & 56.25 & 50 & 56.25 & 68.75 \\ \hline
% Conciseness (TP+OE)
%   & S-MLLMs & 50 & 50 & 50 & 62.5 & 62.5 & 68.75 & 68.75 \\ \cline{2-9}
%   & M-MLLMs & 18.75 & 6.25 & 12.5 & 25 & 43.75 & 50 & 43.75 \\ \cline{2-9}
%   & L-MLLMs & 31.25 & 31.25 & 25 & 43.75 & 50 & 43.75 & 31.25 \\ \hline
% \end{tabular}
% \end{table}

% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%



% \begin{table}[ht]
% \centering
% \caption{
% EA3: Complex Code Generation and Execution Tasks Results Summary. This table reports the average performance (in \%) of Small, Medium, and Large MLLMs on code generation tasks. See detailed discussion in \ref{sec:summaryresults} EA3. Performance metrics include Accuracy, Hallucination, Relevance (Fully and Partially Relevant), Irrelevance, Conciseness (Under Explained - UE, Over Explained - OE). Abbreviations: ZS = Zero-Shot, OS = One-Shot, FS = Few-Shot, CoT = Chain-of-Thought, Anl = Analogical, GK = Generated Knowledge, ToT = Tree-of-Thought.}
% \label{tab:code_generation_tasks_summary_results}
% \renewcommand{\arraystretch}{0.9}
% \setlength{\tabcolsep}{4pt}
% \begin{tabular}{|p{3.5cm}|p{1.7cm}|p{1.2cm}|p{1.2cm}|p{1.2cm}|p{1.2cm}|p{1.2cm}|p{1.2cm}|p{1.2cm}|}
% \hline
% Metrics & Size & ZS & OS & FS & CoT & Anl & GK & ToT \\ \hline
% Accuracy 
%   & S-MLLMs & 46.88 & 56.25 & 53.12 & 50 & 25 & 40.62 & 31.25 \\ \cline{2-9}
%   & M-MLLMs & 78.12 & 87.50 & 90.62 & 84.38 & 65.62 & 78.12 & 78.12 \\ \cline{2-9}
%   & L-MLLMs & 84.38 & 87.50 & 96.88 & 93.75 & 78.12 & 78.12 & 84.38 \\ \hline
% Hallucination 
%   & S-MLLMs & 37.5 & 28.12 & 31.25 & 31.25 & 40.62 & 37.5 & 46.88 \\ \cline{2-9}
%   & M-MLLMs & 3.12 & 0 & 0 & 0 & 12.5 & 6.25 & 3.12 \\ \cline{2-9}
%   & L-MLLMs & 0 & 0 & 0 & 0 & 15.62 & 6.25 & 6.25 \\ \hline
% Relevance (F + P)
%   & S-MLLMs & 81.25 & 78.12 & 84.37 & 87.50 & 71.88 & 71.88 & 75 \\ \cline{2-9}
%   & M-MLLMs & 100 & 100 & 100 & 100 & 100 & 100 & 100 \\ \cline{2-9}
%   & L-MLLMs & 100 & 100 & 100 & 100 & 96.88 & 100 & 100 \\ \hline
% Irrelevance 
%   & S-MLLMs & 18.75 & 21.88 & 15.63 & 12.5 & 28.12 & 28.12 & 25 \\ \cline{2-9}
%   & M-MLLMs & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ \cline{2-9}
%   & L-MLLMs & 0 & 0 & 0 & 0 & 3.12 & 0 & 0 \\ \hline
% Conciseness (UE)
%   & S-MLLMs & 31.25 & 56.25 & 37.5 & 37.5 & 21.88 & 34.38 & 21.88 \\ \cline{2-9}
%   & M-MLLMs & 21.88 & 31.25 & 31.25 & 18.75 & 3.12 & 12.25 & 3.12 \\ \cline{2-9}
%   & L-MLLMs & 9.38 & 15.62 & 21.88 & 12.5 & 6.25 & 12.25 & 3.12 \\ \hline
% Conciseness (TP+OE)
%   & S-MLLMs & 68.75 & 43.75 & 62.5 & 62.5 & 78.12 & 65.62 & 78.12 \\ \cline{2-9}
%   & M-MLLMs & 78.12 & 68.75 & 68.75 & 81.25 & 96.88 & 87.5 & 96.88 \\ \cline{2-9}
%   & L-MLLMs & 90.62 & 84.38 & 78.13 & 87.5 & 93.75 & 87.5 & 96.88 \\ \hline
% \end{tabular}
% \end{table}

% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


% \begin{table}[ht]
% \centering
% \caption{
% EA4: Knowledge Retrieval and Integration Tasks Results Summary. This table presents the average performance (in \%) of Small, Medium, and Large MLLMs on knowledge retrieval tasks. Performance metrics include Accuracy, Hallucination, Relevance (Fully and Partially Relevant), Irrelevance, Conciseness (Under Explained - UE, Over Explained - OE). The prompting techniques assessed include Zero-Shot (ZS), One-Shot (OS), Few-Shot (FS), Chain-of-Thought (CoT), Analogical (Anl), Generated Knowledge (GK), and Tree of Thought (ToT). See \ref{sec:summaryresults} EA4 for in-depth commentary on performance patterns and hallucination.
% }
% \label{tab:knowledge_retrieval_results_summary}
% \renewcommand{\arraystretch}{0.9}
% \setlength{\tabcolsep}{4pt}
% \begin{tabular}{|p{3.3cm}|p{1.7cm}|p{1.3cm}|p{1.3cm}|p{1.3cm}|p{1.3cm}|p{1.3cm}|p{1.3cm}|p{1.3cm}|}
% \hline
% Metrics & Size & ZS & OS & FS & CoT & Anl & GK & ToT \\ \hline
% Accuracy 
%   & S-MLLMs & 43.75 & 34.38 & 37.5 & 37.5 & 21.88 & 28.12 & 32.26 \\ \cline{2-9}
%   & M-MLLMs & 71.88 & 59.38 & 62.5 & 62.5 & 53.12 & 53.12 & 62.5 \\ \cline{2-9}
%   & L-MLLMs & 87.5 & 77.5 & 75 & 77.5 & 75 & 53.12 & 64.1 \\ \hline
% Hallucination 
%   & S-MLLMs & 25 & 31.25 & 34.38 & 25 & 40.62 & 43.75 & 51.61 \\ \cline{2-9}
%   & M-MLLMs & 0 & 0 & 3.12 & 0 & 6.25 & 0 & 6.25 \\ \cline{2-9}
%   & L-MLLMs & 0 & 0 & 0 & 0 & 7.5 & 0 & 2.56 \\ \hline
% Relevance (F + P)
%   & S-MLLMs & 81.25 & 78.12 & 71.88 & 78.12 & 68.75 & 68.75 & 67.74 \\ \cline{2-9}
%   & M-MLLMs & 100 & 100 & 96.88 & 100 & 100 & 96.88 & 93.75 \\ \cline{2-9}
%   & L-MLLMs & 97.5 & 100 & 100 & 100 & 100 & 96.88 & 94.87 \\ \hline
% Irrelevance 
%   & S-MLLMs & 18.75 & 21.88 & 28.12 & 21.88 & 31.25 & 31.25 & 32.26 \\ \cline{2-9}
%   & M-MLLMs & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ \cline{2-9}
%   & L-MLLMs & 2.5 & 0 & 0 & 0 & 0 & 3.12 & 5.13 \\ \hline
% Conciseness (UE)
%   & S-MLLMs & 53.12 & 31.25 & 37.5 & 28.12 & 15.62 & 21.88 & 19.35 \\ \cline{2-9}
%   & M-MLLMs & 28.12 & 62.5 & 62.5 & 43.75 & 53.12 & 46.88 & 34.38 \\ \cline{2-9}
%   & L-MLLMs & 82.5 & 87.5 & 82.5 & 55 & 55 & 46.88 & 41.03 \\ \hline
% Conciseness(TP+OE)
%   & S-MLLMs & 46.88 & 68.75 & 62.5 & 71.88 & 84.38 & 78.12 & 80.65 \\ \cline{2-9}
%   & M-MLLMs & 71.88 & 37.5 & 37.5 & 56.25 & 46.88 & 53.12 & 65.42 \\ \cline{2-9}
%   & L-MLLMs & 17.5 & 12.5 & 17.5 & 45 & 45 & 53.12 & 58.97 \\ \hline
% \end{tabular}
% \end{table}

%%%%%%%%%%%%%%%%%%%
% Response Time, Output Length, and Memory Utilization 
% A comprehensive analysis was conducted on the response time and character length of model outputs. Descriptive statistics for these metrics have been computed for all four tasks and are provided in Tables~\ref{tab:time_statistics_for_reasoning}--\ref{tab:char_statistics_for_knowledge_retrieval}. For each task, separate tables list the average (AVG), standard deviation (STD), median (MEDIAN), minimum (MIN), and maximum (MAX) values.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Performance Profiling of MLLMs}
%: Response Time, Output Length, and Memory Utilization
To better understand the computational efficiency and generative behavior of the evaluated models, we analysed response time, output length, and memory utilization across all four evaluation aspects. Descriptive statistics, including average (AVG), standard deviation (STD), median (MEDIAN), minimum (MIN), and maximum (MAX) were computed for each metric and are presented in Tables~\ref{tab:time_statistics_for_reasoning} -- \ref{tab:char_statistics_for_knowledge_retrieval}.

The results are discussed in three parts. First, we examine model-specific trends in response time across prompting methods and task types, highlighting differences across small, medium, and large MLLMs. Next, we present evaluation aspect-specific insights, comparing how models behave across reasoning, multimodal understanding, code generation, and knowledge retrieval tasks. Finally, we report the memory footprint of each model, noting significant variation based on architecture and scale.

\subsubsection{Model-Specific Observations:}
For small MLLMs, CoT prompting exhibited stable processing times across tasks, ranging approximately from 5.5 to 6.2 seconds. Among all techniques, Analogical and ToT prompting resulted in the longest processing times, with knowledge retrieval tasks requiring the most time and code generation tasks the least. 

Medium MLLMs showed similar trends, where Analogical prompting remained the slowest across tasks. Few-Shot prompting was generally the fastest, although in model understanding and code generation tasks, One-Shot prompting was marginally quicker.

For large MLLMs, Analogical and ToT prompting consistently led to the highest processing times. Few-Shot prompting was typically faster across tasks, except in reasoning and code generation, where the time differences were minimal. CoT prompting was notably stable for large models, with processing durations ranging between 13.5 and 16 seconds.

Within the Large model group, response times naturally vary based on parameter size, architecture, and runtime configuration. For example, in EA1-T1 (Reasoning and Compositionality) (see  Tab\ref{tab:ea1_tasks}), One-Shot prompting shows relatively modest variation; from 0.48 s for Llama3.2-Vision (11B) to 13.09 s for Ovis-1.6 (10.2B), 1.89 s for OmChat V2 (13B), 8.05 s for Pixtral-12B, and with InternVL2-26B at 10.66 s, only about 22× slower than the fastest model. In contrast, Analogical prompting exhibits the widest spread, ranging from 0.40 s for Llama3.2-Vision to 77.69 s for InternVL2-26B nearly 194× slower with intermediate values for Ovis-1.6 (14.17 s), Pixtral-12B (11.03 s), and OmChat V2 (4.08 s). These differences are expected given the diversity in architecture, parameter count, and quantized/runtime configurations. Crucially, despite absolute latency differences, InternVL2-26B aligns with the group’s overall trends in accuracy, hallucination, relevance, and conciseness, supporting its inclusion in the Large category for comparative completeness. This behaviour is consistently observed across all 24 tasks, where similar prompting styles exhibit comparable latency patterns within the Large model group. 


\subsubsection{Evaluation Aspect Specific Obeservations}
\label{EAspecificobs}
For EA1 (Reasoning and Compositionality) tasks, 
Table~\ref{tab:time_statistics_for_reasoning} presents the response time statistics for reasoning tasks, and Table~\ref{tab:char_statistics_for_reasoning} shows the corresponding output lengths. For Small MLLMs, response times are relatively short (AVG $\approx$ 5.3--6.2 s) and outputs are concise, while Large MLLMs exhibit higher variance in response time (AVG up to 22.97 s) and produce more verbose outputs.

For EA2 (Multimodal Understanding and Alignment) tasks,   
Tables~\ref{tab:time_statistics_for_model_understanding} and \ref{tab:char_statistics_for_model_understanding} provide the response time and output length statistics for tasks requiring multimodal understanding. Small MLLMs respond faster (AVG $\approx$ 5.91 s with ZS) and generate shorter outputs (AVG $\approx$ 917 characters with ZS), whereas Large MLLMs require significantly more time (AVG $\approx$ 24.67 s with ZS) and produce longer responses (AVG $\approx$ 1417 characters with ZS).

For EA3 (Complex Code Generation and Execution) tasks,  
Tables~\ref{tab:time_statistics_for_code_generation} and \ref{tab:char_statistics_for_code_generation} report the response time and output length for code generation tasks. Large MLLMs show the highest accuracy with Few-Shot prompting and require longer response times and output lengths compared to smaller models.

For EA4 (Knowledge Retrieval and Integration) tasks,  Tables~\ref{tab:time_statistics_for_knowledge_retrieval} and \ref{tab:char_statistics_for_knowledge_retrieval} detail the response time and output length for knowledge retrieval tasks. In these tasks, Large MLLMs achieve high accuracy with minimal hallucination and produce longer, more detailed responses than Small and Medium models.



\begin{table}[ht]
\centering
\caption{
Response Time Statistics for EA1 (Reasoning and Compositionality) Tasks. Descriptive statistics (in seconds: AVG, STD, MEDIAN, MIN, MAX) are provided for each prompting technique for Small, Medium, and Large MLLMs. Abbreviations: ZS = Zero-Shot, OS = One-Shot, FS = Few-Shot, CoT = Chain-of-Thought, Anl = Analogical, GK = Generated Knowledge, ToT = Tree-of-Thought. Refer Section \ref{EAspecificobs}. 
\label{tab:time_statistics_for_reasoning}
}
\renewcommand{\arraystretch}{0.9}
\setlength{\tabcolsep}{4pt}
\begin{tabular}{|c|l|l|l|l|l|l|}
\hline
\textbf{Model Category} & \textbf{Prompt Technique} & \textbf{AVG} & \textbf{STD} & \textbf{MEDIAN} & \textbf{MIN} & \textbf{MAX} \\ \hline
Small MLLM  & ZS  & 6.24  & 5.24  & 5.17  & 0.49  & 14.51 \\ \cline{2-7} 
            & OS  & 5.55  & 4.73  & 4.80  & 0.49  & 13.95 \\ \cline{2-7} 
            & FS  & 5.30  & 4.69  & 4.45  & 0.48  & 14.52 \\ \cline{2-7} 
            & CoT & 6.51  & 3.91  & 6.15  & 0.87  & 13.91 \\ \cline{2-7} 
            & Anl & 7.47  & 4.68  & 7.51  & 0.81  & 14.48 \\ \cline{2-7} 
            & GK  & 4.83  & 3.93  & 3.36  & 0.55  & 13.95 \\ \cline{2-7} 
            & ToT & 9.51  & 6.26  & 10.33 & 0.47  & 19.22 \\ \hline
Medium MLLM & ZS  & 9.30  & 4.62  & 8.09  & 1.30  & 17.10 \\ \cline{2-7} 
            & OS  & 9.52  & 5.74  & 8.69  & 1.32  & 24.29 \\ \cline{2-7} 
            & FS  & 9.00  & 5.73  & 7.33  & 1.27  & 23.12 \\ \cline{2-7} 
            & CoT & 10.86 & 5.89  & 8.63  & 4.74  & 25.75 \\ \cline{2-7} 
            & Anl & 12.35 & 5.87  & 11.80 & 3.34  & 25.38 \\ \cline{2-7} 
            & GK  & 12.02 & 7.39  & 9.49  & 4.00  & 29.60 \\ \cline{2-7} 
            & ToT & 13.52 & 5.74  & 11.88 & 7.49  & 30.02 \\ \hline
Large MLLM  & ZS  & 22.97 & 26.38 & 12.74 & 1.90  & 101.69 \\ \cline{2-7} 
            & OS  & 19.57 & 25.01 & 10.90 & 0.40  & 101.08 \\ \cline{2-7} 
            & FS  & 21.28 & 26.46 & 10.73 & 0.44  & 101.24 \\ \cline{2-7} 
            & CoT & 26.07 & 30.67 & 13.70 & 0.48  & 101.71 \\ \cline{2-7} 
            & Anl & 29.29 & 35.76 & 16.72 & 0.39  & 120.92 \\ \cline{2-7} 
            & GK  & 20.95 & 24.55 & 13.24 & 0.67  & 101.56 \\ \cline{2-7} 
            & ToT & 26.67 & 26.80 & 16.12 & 0.55  & 101.85 \\ \hline
\end{tabular}

\end{table}



%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%




\begin{table}[ht]
\centering
\caption{
Response Length (Character) Statistics for EA1 (Reasoning and Compositionality) Tasks. This table reports the average output lengths (in characters) along with STD, MEDIAN, MIN, and MAX values for Small, Medium, and Large MLLMs across the prompting techniques. Abbreviations: ZS = Zero-Shot, OS = One-Shot, FS = Few-Shot, CoT = Chain-of-Thought, Anl = Analogical, GK = Generated Knowledge, ToT = Tree-of-Thought. Refer Section \ref{EAspecificobs}. 
\label{tab:char_statistics_for_reasoning}
}
\renewcommand{\arraystretch}{0.9}
\setlength{\tabcolsep}{4pt}
\begin{tabular}{|c|l|l|l|l|l|l|}
\hline
Model Category & Prompt Technique & AVG   & STD   & MEDIAN & MIN  & MAX  \\ \hline
Small MLLM  & ZS  & 878  & 819  & 751  & 11   & 2294 \\ \cline{2-7} 
            & OS  & 763  & 656  & 839    & 11   & 1712 \\ \cline{2-7} 
            & FS  & 785  & 753  & 658  & 11   & 2699 \\ \cline{2-7} 
            & CoT & 960  & 589  & 1069 & 17   & 1869 \\ \cline{2-7} 
            & Anl & 1147 & 823  & 1157   & 67   & 2800 \\ \cline{2-7} 
            & GK  & 670  & 535  & 607    & 1    & 1946 \\ \cline{2-7} 
            & ToT & 1475 & 1008 & 1550 & 1    & 2744 \\ \hline
Medium MLLM & ZS  & 1082 & 596  & 1105 & 82   & 2294 \\ \cline{2-7} 
            & OS  & 1094 & 645  & 873    & 81   & 2174 \\ \cline{2-7} 
            & FS  & 1022 & 677  & 889    & 73   & 2124 \\ \cline{2-7} 
            & CoT & 1151 & 496  & 1074 & 452  & 2402 \\ \cline{2-7} 
            & Anl & 1482 & 663  & 1406   & 397  & 3190 \\ \cline{2-7} 
            & GK  & 1339 & 707  & 1193   & 409  & 3343 \\ \cline{2-7} 
            & ToT & 1645 & 705  & 1573 & 691  & 3845 \\ \hline
Large MLLM  & ZS  & 1417 & 780  & 1313   & 81   & 3111 \\ \cline{2-7} 
            & OS  & 1179 & 630  & 1164 & 86   & 2437 \\ \cline{2-7} 
            & FS  & 1206 & 708  & 1083 & 79   & 2580 \\ \cline{2-7} 
            & CoT & 1459 & 642  & 1410 & 438  & 2810 \\ \cline{2-7} 
            & Anl & 1630 & 660  & 1595   & 392  & 2725 \\ \cline{2-7} 
            & GK  & 1435 & 574  & 1439   & 453  & 2457 \\ \cline{2-7} 
            & ToT & 1768 & 905  & 1602   & 770  & 4432 \\ \hline
\end{tabular}
\end{table}



\begin{table}[htbp]
\centering
\caption{Response Time Statistics for EA2 (Model Understanding and Alignment) Tasks. This table presents the response time (in seconds) descriptive statistics for each prompting technique for Small, Medium, and Large MLLMs. Abbreviations: ZS = Zero-Shot, OS = One-Shot, FS = Few-Shot, CoT = Chain-of-Thought, Anl = Analogical, GK = Generated Knowledge, ToT = Tree-of-Thought. Refer Section \ref{EAspecificobs}. 
}
\label{tab:time_statistics_for_model_understanding}
\renewcommand{\arraystretch}{0.9}
\setlength{\tabcolsep}{4pt}
\begin{tabular}{|c|l|l|l|l|l|l|}
\hline
Model Category & Prompt Technique & AVG  & STD  & MEDIAN & MIN  & MAX  \\ \hline
Small MLLM  & ZS  & 5.91  & 4.40  & 4.76  & 0.42  & 15.01 \\ \cline{2-7} 
            & OS  & 3.60  & 2.08  & 3.84  & 0.78  & 7.67  \\ \cline{2-7} 
            & FS  & 3.80  & 2.32  & 3.96  & 0.51  & 8.55  \\ \cline{2-7} 
            & CoT & 5.90  & 3.80  & 5.64  & 0.61  & 14.24 \\ \cline{2-7} 
            & Anl & 8.54  & 5.46  & 10.26 & 0.65  & 16.08 \\ \cline{2-7} 
            & GK  & 6.21  & 4.54  & 5.12  & 0.61  & 14.53 \\ \cline{2-7} 
            & ToT & 6.95  & 4.97  & 6.68  & 0.62  & 18.72 \\ \hline
Medium MLLM & ZS  & 8.69  & 3.25  & 8.32  & 2.64  & 14.68 \\ \cline{2-7} 
            & OS  & 9.05  & 3.69  & 7.71  & 4.37  & 15.40 \\ \cline{2-7} 
            & FS  & 9.25  & 3.90  & 7.91  & 4.09  & 17.42 \\ \cline{2-7} 
            & CoT & 11.98 & 3.02  & 12.01 & 7.73  & 19.84 \\ \cline{2-7} 
            & Anl & 14.05 & 4.89  & 14.94 & 7.96  & 25.33 \\ \cline{2-7} 
            & GK  & 11.29 & 4.98  & 11.24 & 3.72  & 21.71 \\ \cline{2-7} 
            & ToT & 13.85 & 7.07  & 12.18 & 3.63  & 32.82 \\ \hline
Large MLLM  & ZS  & 24.67 & 27.87 & 15.15 & 0.95  & 102.57 \\ \cline{2-7} 
            & OS  & 19.82 & 21.79 & 13.15 & 0.60  & 77.78  \\ \cline{2-7} 
            & FS  & 19.70 & 24.63 & 14.13 & 0.76  & 113.42 \\ \cline{2-7} 
            & CoT & 25.06 & 27.28 & 15.86 & 0.87  & 103.77 \\ \cline{2-7} 
            & Anl & 31.67 & 34.42 & 20.38 & 0.68  & 103.31 \\ \cline{2-7} 
            & GK  & 25.67 & 29.44 & 13.93 & 0.90  & 99.00  \\ \cline{2-7} 
            & ToT & 28.89 & 29.90 & 19.34 & 0.94  & 119.25 \\ \hline
\end{tabular}
\end{table}

\begin{table}[htbp]
\centering
\caption{
Response Length (Character) Statistics for EA2 (Model Understanding and Alignment) Tasks. This table lists the output length (in characters) descriptive statistics for Small, Medium, and Large MLLMs. Abbreviations: ZS = Zero-Shot, OS = One-Shot, FS = Few-Shot, CoT = Chain-of-Thought, Anl = Analogical, GK = Generated Knowledge, ToT = Tree-of-Thought. Refer Section \ref{EAspecificobs}. 
}
\label{tab:char_statistics_for_model_understanding}
\renewcommand{\arraystretch}{0.9}
\setlength{\tabcolsep}{4pt}
\begin{tabular}{|c|l|l|l|l|l|l|}
\hline
Model Category & Prompt Technique & AVG   & STD   & MEDIAN & MIN  & MAX  \\ \hline
Small MLLM  & ZS  & 917  & 751  & 706  & 4    & 2100 \\ \cline{2-7} 
            & OS  & 565  & 419  & 522  & 62   & 1558 \\ \cline{2-7} 
            & FS  & 608  & 492  & 617  & 4    & 1790 \\ \cline{2-7} 
            & CoT & 894  & 614  & 799  & 20   & 1979 \\ \cline{2-7} 
            & Anl & 1319 & 841  & 1640 & 11   & 2380 \\ \cline{2-7} 
            & GK  & 1008 & 895  & 620  & 7    & 2612 \\ \cline{2-7} 
            & ToT & 1087 & 802  & 1113 & 8    & 2553 \\ \hline
Medium MLLM & ZS  & 1007 & 452  & 967  & 202  & 1844 \\ \cline{2-7} 
            & OS  & 985  & 416  & 931  & 280  & 1857 \\ \cline{2-7} 
            & FS  & 1030 & 444  & 1048 & 406  & 1899 \\ \cline{2-7} 
            & CoT & 1151 & 496  & 1074 & 452  & 2402 \\ \cline{2-7} 
            & Anl & 1482 & 663  & 1406   & 397  & 3190 \\ \cline{2-7} 
            & GK  & 1339 & 707  & 1193   & 409  & 3343 \\ \cline{2-7} 
            & ToT & 1645 & 705  & 1573 & 691  & 3845 \\ \hline
Large MLLM  & ZS  & 1417 & 780  & 1313   & 81   & 3111 \\ \cline{2-7} 
            & OS  & 1179 & 630  & 1164 & 86   & 2437 \\ \cline{2-7} 
            & FS  & 1206 & 708  & 1083 & 79   & 2580 \\ \cline{2-7} 
            & CoT & 1459 & 642  & 1410 & 438  & 2810 \\ \cline{2-7} 
            & Anl & 1630 & 660  & 1595   & 392  & 2725 \\ \cline{2-7} 
            & GK  & 1435 & 574  & 1439   & 453  & 2457 \\ \cline{2-7} 
            & ToT & 1768 & 905  & 1602   & 770  & 4432 \\ \hline
\end{tabular}
\end{table}


\begin{table}[htbp]
\centering
\caption{
Response Time Statistics for EA3 (Code Generation and Execution) Tasks. Descriptive statistics (in seconds: AVG, STD, MEDIAN, MIN, MAX) for response time across prompting techniques for Small, Medium, and Large MLLMs. Abbreviations: ZS = Zero-Shot, OS = One-Shot, FS = Few-Shot, CoT = Chain-of-Thought, Anl = Analogical, GK = Generated Knowledge, ToT = Tree-of-Thought. Refer Section \ref{EAspecificobs}. 
}
\label{tab:time_statistics_for_code_generation}
\renewcommand{\arraystretch}{0.9}
\setlength{\tabcolsep}{4pt}
\begin{tabular}{|c|l|l|l|l|l|l|}
\hline
Model Category & Prompt Technique & AVG  & STD  & MEDIAN & MIN  & MAX  \\ \hline
Small MLLM  & ZS  & 5.35  & 3.95  & 4.97  & 0.62  & 15.46 \\ \cline{2-7} 
            & OS  & 3.47  & 2.69  & 2.98  & 0.47  & 14.68 \\ \cline{2-7} 
            & FS  & 4.64  & 3.50  & 3.82  & 0.65  & 14.63 \\ \cline{2-7} 
            & CoT & 6.28  & 4.67  & 6.05  & 0.64  & 15.03 \\ \cline{2-7} 
            & Anl & 7.23  & 4.16  & 7.13  & 0.85  & 14.67 \\ \cline{2-7} 
            & GK  & 5.18  & 3.94  & 4.63  & 0.57  & 14.68 \\ \cline{2-7} 
            & ToT & 8.50  & 5.64  & 8.28  & 0.63  & 20.91 \\ \hline
Medium MLLM & ZS  & 6.77  & 3.55  & 6.63  & 1.53  & 17.50 \\ \cline{2-7} 
            & OS  & 4.48  & 1.38  & 4.26  & 2.15  & 7.73  \\ \cline{2-7} 
            & FS  & 4.60  & 1.44  & 4.32  & 1.96  & 7.21  \\ \cline{2-7} 
            & CoT & 7.40  & 4.70  & 6.43  & 1.58  & 24.34 \\ \cline{2-7} 
            & Anl & 13.08 & 6.02  & 11.37 & 5.65  & 31.50 \\ \cline{2-7} 
            & GK  & 7.46  & 3.29  & 7.61  & 1.98  & 15.04 \\ \cline{2-7} 
            & ToT & 12.67 & 6.21  & 11.02 & 3.31  & 30.59 \\ \hline
Large MLLM  & ZS  & 16.48 & 15.63 & 11.41 & 1.90  & 68.60 \\ \cline{2-7} 
            & OS  & 12.63 & 11.82 & 11.37 & 0.31  & 50.79 \\ \cline{2-7} 
            & FS  & 12.02 & 10.90 & 10.82 & 0.37  & 50.76 \\ \cline{2-7} 
            & CoT & 19.70 & 23.71 & 13.89 & 0.53  & 97.30 \\ \cline{2-7} 
            & Anl & 34.01 & 38.85 & 20.91 & 0.44  & 126.27\\ \cline{2-7} 
            & GK  & 17.78 & 20.62 & 13.34 & 0.65  & 88.48 \\ \cline{2-7} 
            & ToT & 29.46 & 33.48 & 20.44 & 0.62  & 124.55 \\ \hline
\end{tabular}
\end{table}



\begin{table}[htbp]
\centering
\caption{
Response Length (Character) Statistics for EA3 (Code Generation and Execution) Tasks. This table reports the output length (in characters) descriptive statistics for each prompting technique for Small, Medium, and Large MLLMs. Abbreviations: ZS = Zero-Shot, OS = One-Shot, FS = Few-Shot, CoT = Chain-of-Thought, Anl = Analogical, GK = Generated Knowledge, ToT = Tree-of-Thought. Refer Section \ref{EAspecificobs}. 
}
\label{tab:char_statistics_for_code_generation}
\renewcommand{\arraystretch}{0.9}
\setlength{\tabcolsep}{4pt}
\begin{tabular}{|c|l|l|l|l|l|l|}
\hline
Model Category & Prompt Technique & AVG   & STD   & MEDIAN & MIN  & MAX  \\ \hline
Small MLLM  & ZS  & 577  & 417   & 562  & 41   & 1621 \\ \cline{2-7} 
            & OS  & 364  & 340   & 256  & 9    & 1690 \\ \cline{2-7} 
            & FS  & 565  & 494   & 392  & 26   & 2054 \\ \cline{2-7} 
            & CoT & 736  & 619   & 807    & 10   & 2189 \\ \cline{2-7} 
            & Anl & 1009   & 592   & 1134 & 61   & 2321 \\ \cline{2-7} 
            & GK  & 637  & 555   & 396    & 4    & 1880 \\ \cline{2-7} 
            & ToT & 1129   & 772   & 1192 & 2    & 2468 \\ \hline
Medium MLLM & ZS  & 559  & 345   & 467    & 157  & 1426 \\ \cline{2-7} 
            & OS  & 353    & 177   & 316  & 176  & 1039 \\ \cline{2-7} 
            & FS  & 374  & 201   & 341  & 176  & 899  \\ \cline{2-7} 
            & CoT & 671  & 461   & 513    & 131  & 1571 \\ \cline{2-7} 
            & Anl & 1304   & 550   & 1176   & 478  & 2946 \\ \cline{2-7} 
            & GK  & 691  & 365   & 655  & 163  & 1383 \\ \cline{2-7} 
            & ToT & 1292   & 599   & 1311 & 161  & 2827 \\ \hline
Large MLLM  & ZS  & 995  & 591   & 875    & 112  & 2799 \\ \cline{2-7} 
            & OS  & 980    & 813   & 771    & 170  & 3585 \\ \cline{2-7} 
            & FS  & 780  & 489   & 748  & 112  & 2143 \\ \cline{2-7} 
            & CoT & 1130   & 654   & 1074 & 142  & 2451 \\ \cline{2-7} 
            & Anl & 1692   & 672   & 1770 & 485  & 3246 \\ \cline{2-7} 
            & GK  & 1016   & 485   & 1080 & 117  & 1911 \\ \cline{2-7} 
            & ToT & 1527   & 657   & 1520 & 122  & 2807 \\ \hline
\end{tabular}
\end{table}



\begin{table}[htbp]
\centering
\caption{
Response Time Statistics for EA4 (Knowledge Retrieval and Integration) Tasks. This table provides descriptive statistics (in seconds) for response time across the seven prompting techniques for Small, Medium, and Large MLLMs. Abbreviations: ZS = Zero-Shot, OS = One-Shot, FS = Few-Shot, CoT = Chain-of-Thought, Anl = Analogical, GK = Generated Knowledge, ToT = Tree-of-Thought. Refer Section \ref{EAspecificobs}. 
}
\label{tab:time_statistics_for_knowledge_retrieval}
\renewcommand{\arraystretch}{0.9}
\setlength{\tabcolsep}{4pt}
\begin{tabular}{|c|l|l|l|l|l|l|}
\hline
Model Category & Prompt Technique & AVG  & STD  & MEDIAN & MIN  & MAX  \\ \hline
Small MLLM  & ZS  & 8.08  & 5.11  & 8.56  & 0.74  & 18.01 \\ \cline{2-7} 
            & OS  & 5.54  & 3.17  & 6.02  & 0.49  & 12.51 \\ \cline{2-7} 
            & FS  & 6.24  & 3.84  & 6.21  & 0.60  & 14.19 \\ \cline{2-7} 
            & CoT & 6.19  & 3.62  & 6.14  & 0.62  & 13.41 \\ \cline{2-7} 
            & Anl & 9.97  & 5.59  & 11.71 & 0.86  & 18.20 \\ \cline{2-7} 
            & GK  & 5.99  & 4.01  & 6.37  & 0.65  & 14.59 \\ \cline{2-7} 
            & ToT & 8.17  & 5.64  & 8.01  & 0.82  & 19.09 \\ \hline
Medium MLLM & ZS  & 9.95  & 3.19  & 9.54  & 4.72  & 17.92 \\ \cline{2-7} 
            & OS  & 9.31  & 4.29  & 8.00  & 4.31  & 21.60 \\ \cline{2-7} 
            & FS  & 8.42  & 3.70  & 7.44  & 2.81  & 20.88 \\ \cline{2-7} 
            & CoT & 11.48 & 4.68  & 9.97  & 5.85  & 24.76 \\ \cline{2-7} 
            & Anl & 14.83 & 6.26  & 12.80 & 4.34  & 26.58 \\ \cline{2-7} 
            & GK  & 12.49 & 6.39  & 10.63 & 0.95  & 29.66 \\ \cline{2-7} 
            & ToT & 13.44 & 5.80  & 11.99 & 0.76  & 30.51 \\ \hline
Large MLLM  & ZS  & 29.54 & 36.69 & 14.72 & 2.22  & 122.41 \\ \cline{2-7} 
            & OS  & 21.42 & 24.86 & 13.45 & 0.62  & 117.56 \\ \cline{2-7} 
            & FS  & 21.69 & 24.82 & 13.33 & 0.75  & 89.18  \\ \cline{2-7} 
            & CoT & 27.32 & 32.82 & 15.00 & 0.90  & 122.06 \\ \cline{2-7} 
            & Anl & 34.61 & 39.30 & 19.70 & 0.78  & 123.03 \\ \cline{2-7} 
            & GK  & 26.50 & 32.38 & 14.46 & 0.95  & 123.01 \\ \cline{2-7} 
            & ToT & 31.71 & 36.70 & 19.34 & 0.95  & 123.46 \\ \hline
\end{tabular}
\end{table}


\begin{table}[htbp]
\centering
\caption{
Response Length (Character) Statistics for EA4 (Knowledge Retrieval and Integration) Tasks. This table provides output length (in characters) descriptive statistics for Small, Medium, and Large MLLMs across the seven prompting techniques. Abbreviations: ZS = Zero-Shot, OS = One-Shot, FS = Few-Shot, CoT = Chain-of-Thought, Anl = Analogical, GK = Generated Knowledge, ToT = Tree-of-Thought. Refer Section \ref{EAspecificobs}. 
}
\label{tab:char_statistics_for_knowledge_retrieval}
\renewcommand{\arraystretch}{0.9}
\setlength{\tabcolsep}{4pt}
\begin{tabular}{|c|l|l|l|l|l|l|}
\hline
\textbf{Model Category} & \textbf{Prompt} & \textbf{AVG} & \textbf{STD} & \textbf{MEDIAN} & \textbf{MIN} & \textbf{MAX} \\ \hline
Small MLLM  & ZS  & 1236  & 780  & 1186 & 39   & 2649 \\ \cline{2-7}
            & OS  & 859 & 500  & 987 & 9    & 1642 \\ \cline{2-7}
            & FS  & 984 & 664  & 987  & 10   & 2483 \\ \cline{2-7}
            & CoT & 956 & 597  & 980  & 9    & 2371 \\ \cline{2-7}
            & Anl & 1572  & 892  & 2018 & 62   & 2509 \\ \cline{2-7}
            & GK  & 939.5 & 710  & 935  & 7    & 2680 \\ \cline{2-7}
            & ToT & 1266  & 890  & 1266 & 29   & 2588 \\ \hline
Medium MLLM & ZS  & 1161  & 325  & 1203 & 449  & 1725 \\ \cline{2-7}
            & OS  & 1042  & 389  & 1012 & 453  & 2111 \\ \cline{2-7}
            & FS  & 974 & 430  & 837  & 212  & 2332 \\ \cline{2-7}
            & CoT & 1281  & 398  & 1253 & 673  & 2375 \\ \cline{2-7}
            & Anl & 1742  & 488  & 1772 & 676  & 2743 \\ \cline{2-7}
            & GK  & 1415  & 734  & 1314 & 588  & 4174 \\ \cline{2-7}
            & ToT & 1520  & 584  & 1404 & 624  & 3439 \\ \hline
Large MLLM  & ZS  & 1655  & 599  & 1726 & 607  & 3879 \\ \cline{2-7}
            & OS  & 1436  & 552  & 1321 & 612  & 2719 \\ \cline{2-7}
            & FS  & 1310  & 508  & 1217 & 613  & 2597 \\ \cline{2-7}
            & CoT & 1590  & 663  & 1469 & 448  & 3175 \\ \cline{2-7}
            & Anl & 2103  & 1263 & 1855 & 542  & 6551 \\ \cline{2-7}
            & GK  & 1523  & 691  & 1295 & 549  & 3074 \\ \cline{2-7}
            & ToT & 1892  & 940  & 1747 & 579  & 4965 \\ \hline
\end{tabular}
\end{table}


\subsubsection{Model Memory Allocation}
\label{modelmemoryalloc}
Table~\ref{tab:model_memory_size_by_category} lists the selected MLLMs along with their categorization (Small, Medium, or Large), model size (in billions), and allocated GPU memory (in GB). For example, the Pixtral 12B model uses 35\,GB, while the InternVL2-1B model uses only 0.05\,GB. These variations highlight the impact of architecture-specific optimizations in addition to model size.

In summary, our analysis reveals consistent patterns across prompting techniques and model sizes. Analogical prompting typically resulted in the longest response times and the most verbose outputs, followed by Tree-of-Thought (ToT). In contrast, Few-Shot and One-Shot prompting were generally faster and produced more concise outputs. Among task types, code generation was the fastest to process, while response time tended to increase with model size. 

\begin{table}[htbp]
\centering
\normalsize  % ensures it matches body font size
\caption{
Model Memory Allocation and Categorization. This table lists individual MLLMs along with their category (Small, Medium, or Large), model size (in billions), and allocated GPU memory (in GB). Refer Section \ref{modelmemoryalloc}.
}
\label{tab:model_memory_size_by_category}
\renewcommand{\arraystretch}{0.9}
\setlength{\tabcolsep}{4pt}
% \resizebox{\textwidth}{!}{%
\begin{tabular}{|c|l|l|l|}
\hline
\textbf{Model Category} & \textbf{Model Name}    & \textbf{Model Size (B)} & \textbf{Allocated Memory (GB)} \\ \hline
\multirow{4}{*}{Small MLLM} & InternVL2-1B         & 1    & 0.05 \\ \cline{2-4} 
                           & Qwen2-vl             & 2    & 4.16 \\ \cline{2-4} 
                           & MiniMonkey           & 2.2  & 4.49 \\ \cline{2-4} 
                           & Paligemma            & 3    & 10.96 \\ \hline
\multirow{4}{*}{Medium MLLM} & Phi3.5               & 4    & 7.77  \\ \cline{2-4} 
                           & Llava-one-vision     & 8    & 15.05 \\ \cline{2-4} 
                           & Ovis1.5-8B           & 8    & 17.33 \\ \cline{2-4} 
                           & Glm-4v               & 9    & 25.94 \\ \hline
\multirow{5}{*}{Large MLLM}  & Ovis1.6              & 10.2 & 19.02 \\ \cline{2-4} 
                           & Llama3.2-vision      & 11   & 11.54 \\ \cline{2-4} 
                           & Pixtral              & 12   & 35.30 \\ \cline{2-4} 
                           & Omchat               & 13   & 24.62 \\ \cline{2-4} 
                           & InternVL2-26B        & 26   & 25.27 \\ \hline
\end{tabular}
% }
\end{table}




\section{Discussion}
\label{sec:discussion}

Our evaluation reveals significant variations in MLLM performance across task types and prompting techniques. Overall, larger models consistently outperform medium and small models, especially in tasks such as knowledge retrieval and code generation, yet tasks requiring complex reasoning and nuanced understanding still yield relatively low accuracies (often below 60\%). This disparity indicates that while scaling improves certain capabilities, even the largest models struggle with multi-step reasoning and abstract deduction.

For tasks that require multi-step reasoning and compositional problem solving, our results show that providing multiple examples (Few-Shot prompting) enhances accuracy in large models. However, more structured prompting approaches (such as Chain-of-Thought, Analogical, and Tree-of-Thought) tend to increase hallucination rates, particularly in smaller models. This suggests that, although structured prompts are designed to guide logical inference by encouraging intermediate reasoning steps, they can sometimes introduce extraneous or confabulated details that ultimately undermine output quality. 

In tasks demanding multimodal understanding and alignment (EA2), large models achieve near-perfect relevance scores when using simpler prompting strategies (e.g., Zero-Shot and One-Shot). This implies that pre-trained multimodal embeddings in these models are highly effective at integrating text and visual inputs. 

Zero-Shot prompting emerged as the most effective technique in EA2, achieving the highest accuracy and lowest hallucination rates across model sizes. This suggests that MLLMs may rely heavily on pre-trained multimodal embeddings rather than explicit reasoning across different modalities. In contrast, complex reasoning-based prompts, such as Analogical and Tree-of-Thought, degraded performance, indicating that multimodal models struggle when required to interpret and synthesize abstract relationships between text and images. These results highlight limitations in current MLLMs’ spatial and contextual awareness, which are critical for applications such as visual question answering (VQA), AI-generated content moderation, and automated medical image interpretation. While MLLMs can extract information from multimodal inputs, they lack deep semantic alignment; a challenge that must be addressed before deploying these models in high-risk environments. Given these limitations, current open-source MLLMs cannot be relied upon for reasoning-intensive tasks without human oversight or external validation mechanisms.

Code generation tasks exhibited the highest accuracies across all model sizes. The structured nature of programming tasks appears to benefit from Few-Shot prompting, which provides clear examples that guide both syntactic and semantic generation. Nonetheless, hallucination remains an issue, especially in smaller models; which is critical in contexts such as software development and cybersecurity where errors can have severe consequences.

Knowledge retrieval tasks also demonstrate the advantage of scaling: large MLLMs achieve the highest accuracy and relevance, particularly with Zero-Shot prompting. However, these models sometimes present outputs with unwarranted confidence, even when portions of the retrieved information are incorrect. This lack of reliable verification is problematic in domains that demand high factual accuracy, such as legal, medical, and scientific applications.

Hallucination remained a fundamental challenge across all models and prompting strategies, particularly in tasks requiring abstract reasoning. Analogical, General Knowledge, and Tree-of-Thought prompting exhibited the highest hallucination rates, suggesting that the current implementations of structured reasoning within MLLMs remain unreliable. This is especially concerning in safety-critical applications where factual correctness is imperative, such as AI-generated medical reports, legal document drafting, and financial risk assessments.

Our analysis of response times and output lengths further underscores the trade-offs inherent in different prompting techniques. More complex methods like Analogical and Tree-of-Thought prompting require longer processing times and produce more verbose outputs, while One-Shot and Few-Shot prompting yield faster and more concise responses. Although larger models generally incur higher computational costs and longer response times, the improvements in accuracy and relevance, particularly for multimodal understanding and knowledge retrieval often justify these trade-offs.

In summary, no single prompting method optimally addresses every task. The effectiveness of a prompting strategy is highly dependent on the nature of the task and the model scale. For instance, Few-Shot prompting appears best suited for structured tasks like code generation, while simpler prompting techniques are more effective for multimodal alignment. These findings suggest that hybrid approaches, combining example-based prompts with selective structured reasoning, may offer a promising path toward more reliable and contextually aware multimodal reasoning.

% \textbf{Table Navigation Summary:}
% \begin{itemize}
%     \item \textbf{Tables 7, 11–12:} EA1 – Reasoning \& Compositionality
%     \item \textbf{Tables 8, 13–14:} EA2 – Multimodal Understanding \& Alignment
%     \item \textbf{Tables 9, 15–16:} EA3 – Code Generation \& Execution
%     \item \textbf{Tables 10, 17–18:} EA4 – Knowledge Retrieval \& Integration
%     \item \textbf{Table 19:} Model Memory Allocation 
% \end{itemize}

\subsection{Implications and Use Cases}

These findings highlight critical implications for deploying MLLMs in real-world scenarios. While large MLLMs demonstrate strong retrieval and structured output generation, their failure in logical reasoning and multimodal alignment indicates that they are currently unsuitable for fully autonomous decision-making systems in healthcare, finance, or legal domains. Instead, their most effective applications lie in AI-assisted software development, where few-shot prompting can improve code generation workflows while integrating human validation to mitigate errors. Automated knowledge retrieval systems, where large MLLMs can assist in search and summarization tasks but require additional verification mechanisms to ensure factual accuracy. AI-powered tutoring systems, where structured output generation can support educational applications, but deeper logical reasoning capabilities must be further refined. Visual question answering and multimodal content moderation, where large MLLMs can be used to process images and text but require improvements in contextual alignment. Conversely, caution must be exercised when integrating MLLMs into fields where reasoning-based accuracy is paramount. Current models struggle to maintain logical consistency in long-form reasoning tasks, limiting their utility in legal contract analysis, autonomous robotic planning, and financial forecasting.

% \subsection{Future Research Directions}

% The findings of this study highlight several critical areas for future research to enhance the reliability and effectiveness of MLLMs. One promising direction is the development of hybrid prompting strategies, where a combination of few-shot examples and explicit logical structuring could improve reasoning-intensive tasks by guiding models through step-by-step inference. Additionally, memory-augmented models could be explored to enable MLLMs to reference factual information more effectively, reducing the likelihood of hallucinations and improving long-term contextual understanding. Another essential avenue is the advancement of explainability and verification frameworks, particularly for high-stakes applications in legal, medical, and financial domains, where the factual consistency of generated content is crucial.

% Furthermore, integrating neurosymbolic AI approaches, which combine deep learning with symbolic reasoning could enhance logical inference capabilities, especially in tasks requiring structured decision-making. In the context of multimodal alignment, research should focus on improving spatial awareness, cross-modal dependencies, and semantic consistency, ensuring that MLLMs can effectively interpret and synthesize diverse inputs. Finally, bias and robustness studies remain indispensable, as the failure of reasoning-based prompting techniques suggests that current architectures may not generalize well to unseen logical structures. Investigating dataset biases and refining training methodologies will be critical to ensuring that MLLMs become more reliable, fair, and interpretable across a wider range of real-world applications. Addressing these challenges through targeted research will be crucial in advancing MLLMs beyond pattern recognition, enabling them to perform more consistent, factually grounded, and contextually aware reasoning in complex decision-making tasks.

% Additionally, exploring adaptive prompting strategies and self-correcting mechanisms could provide a pathway toward enhancing MLLMs' generalizability and reliability across diverse domains. This study is a motivation for advancing AI systems from reactive models to proactive, agentic entities \citep{russell2020artificial} capable of sustained, goal-oriented reasoning. As Agentic AI continues to evolve, robust evaluation frameworks such as the one presented in our work will be essential for ensuring that MLLMs are not only technically proficient but also trustworthy, interpretable, and capable of autonomous knowledge synthesis in complex real-world scenarios. 

\subsection{Future Research Directions}

The findings of this study highlight several critical areas for future research to enhance the reliability and effectiveness of MLLMs. One promising direction is the development of hybrid prompting strategies, where a combination of few-shot examples and explicit logical structuring could improve reasoning-intensive tasks by guiding models through step-by-step inference. Additionally, memory-augmented models could be explored to enable MLLMs to reference factual information more effectively, reducing the likelihood of hallucinations and improving long-term contextual understanding. Another essential avenue is the advancement of explainability and verification frameworks, particularly for high-stakes applications in legal, medical, and financial domains, where the factual consistency of generated content is crucial.

Furthermore, integrating neurosymbolic AI approaches, which combine deep learning with symbolic reasoning, could enhance logical inference capabilities, especially in tasks requiring structured decision-making. In the context of multimodal alignment, research should focus on improving spatial awareness, cross-modal dependencies, and semantic consistency, ensuring that MLLMs can effectively interpret and synthesize diverse inputs. While this study deliberately scoped to text–image tasks for reproducibility, extending evaluations to temporal modalities such as video remains a natural next step, and we flag this as a priority for future multimodal benchmarking.

Equally important is deeper diagnostic analysis of prompting methods. Although this paper compared seven prompt families, future work should conduct fine-grained ablation studies such as toggling intermediate steps in CoT or analogy scaffolding to isolate which elements contribute most to accuracy or hallucination. This will provide a clearer attribution of risk and reliability across prompting strategies.

Finally, bias and robustness studies remain indispensable, as the failure of reasoning-based prompting techniques suggests that current architectures may not generalize well to unseen logical structures. Investigating dataset biases and refining training methodologies will be critical to ensuring that MLLMs become more reliable, fair, and interpretable across a wider range of real-world applications. Addressing these challenges through targeted research will be crucial in advancing MLLMs beyond pattern recognition, enabling them to perform more consistent, factually grounded, and contextually aware reasoning in complex decision-making tasks.

Additionally, exploring adaptive prompting strategies and self-correcting mechanisms could provide a pathway toward enhancing MLLMs’ generalizability and reliability across diverse domains. This study is a motivation for advancing AI systems from reactive models to proactive, agentic entities \citep{russell2020artificial} capable of sustained, goal-oriented reasoning. As Agentic AI continues to evolve, robust evaluation frameworks such as the one presented in our work will be essential for ensuring that MLLMs are not only technically proficient but also trustworthy, interpretable, and capable of autonomous knowledge synthesis in complex real-world scenarios.



\section{Conclusion}
This study systematically evaluated open-source MLLMs across a diverse scale of model sizes using a structured benchmarking framework to assess their performance across four key evaluation aspects spanning 24 tasks. By employing diverse prompting techniques, including Zero-shot, One-shot, Few-shot, Chain-of-Thought, Analogical reasoning, Generated Knowledge, Tree-of-Thought; which together combine example-based guidance with structured reasoning, the evaluation provided insights into how these models process multimodal inputs and generate outputs aligned with expected task solutions. These strategies were empirically tested and quantitatively validated, with full prompt templates and outputs documented in the supplementary materials. Our findings highlight the varying effectiveness of different prompting strategies in enhancing MLLMs' interpretability, consistency, and reasoning depth. While some models demonstrated strong performance in certain multimodal translation and cross-modal reasoning tasks, challenges persist in areas requiring deeper contextual understanding, abstraction, and nuanced interpretation of complex inputs. The evaluation underscores the necessity for improved prompt engineering methodologies and more robust benchmarks tailored for multimodal AI. Future work will focus on refining the evaluation criteria, expanding the dataset scope, and integrating real-world application scenarios to further stress-test MLLMs. This includes expanding upon this framework by integrating additional modalities such as video comprehension, auditory processing, and multi-turn interactions to build a more comprehensive evaluation paradigm. Ultimately, these efforts aim to inform the design of next-generation MLLMs that are not only technically proficient but also adaptable, context-aware, and aligned with practical, human-centric applications.









\FloatBarrier

\bibliography{main}
\bibliographystyle{tmlr}

% \appendix
% \section{Appendix}
% You may include other additional sections here.
\appendix
\input{appendix}

\newpage

\renewcommand{\thesection}{S\arabic{section}} 
\setcounter{section}{0}
\section*{Supplementary Material}

\addcontentsline{toc}{section}{Supplementary Material}
\input{supplementary}

\end{document}
