%\documentclass{uai2025} % for initial submission
\documentclass[accepted]{uai2025} % after acceptance, for a revised version; 
% also before submission to see how the non-anonymous paper would look like 
                        
%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2025} % ptmx math instead of Computer
                                         % Modern (has noticeable issues)
% \documentclass[mathfont=newtx]{uai2025} % newtx fonts (improves upon
                                          % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
\usepackage{natbib} % has a nice set of citation styles and commands
    \bibliographystyle{plainnat}
    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage{amsmath}
\usepackage{algorithm}
\usepackage{algpseudocode}
\usepackage{graphicx}
\usepackage{subcaption}
\usepackage{array}
\usepackage{authblk}
\usepackage{colortbl}
\usepackage{ulem}
\usepackage{amssymb}
\usepackage{tikz}
\usetikzlibrary{shapes.geometric, arrows, fit, positioning}

% TikZ style definitions
\tikzstyle{input} = [rectangle, rounded corners, minimum width=3cm, minimum height=1cm,text centered, draw=black, fill=blue!15]
\tikzstyle{process} = [rectangle, minimum width=3cm, minimum height=1cm, text centered, draw=black, fill=orange!20]
\tikzstyle{loss} = [ellipse, minimum width=2cm, minimum height=1cm, text centered, draw=black, fill=red!10]
\tikzstyle{output} = [rectangle, rounded corners, minimum width=3cm, minimum height=1cm, text centered, draw=black, fill=green!20]
\tikzstyle{arrow} = [thick,->,>=stealth]
%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Letting Uncertainty Guide Your Multimodal Machine Translation}

% The standard author block has changed for UAI 2025 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{Wuyi Liu$^*$}
\author[1]{Yue Gao$^*$}
\author[2]{Yige Mao}
\author[1]{Jing Zhao$^\dagger$}
% Add affiliations after the authors
\affil[1]{%
    Computer Science Dept.\\
    East China Normal University\\
    Shanghai, China
}
\affil[2]{%
    Beihang University\\
    Beijing, China\\
}

\begin{document}
\maketitle
% 使用以下命令来防止出现编号
\renewcommand{\thefootnote}{\fnsymbol{footnote}}
\footnotetext{$^*$ Equal contribution.}
\footnotetext{$^\dagger$ Corresponding author: jzhao@cs.ecnu.edu.cn}
% 重置为原来的编号样式
\renewcommand{\thefootnote}{\arabic{footnote}}
% 重置脚注计数器
\setcounter{footnote}{0}

\begin{abstract}
%However, current approaches lack explicit mechanisms to quantify and manage translation uncertainty, leading to incomplete utilization of visual information and even potential degradation of translation quality when using visual information.%
Multimodal Machine Translation (MMT) leverages additional modalities, such as visual data, to enhance translation accuracy and resolve linguistic ambiguities inherent in text-only approaches. Recent advancements predominantly focus on integrating image information via attention mechanisms or feature fusion techniques. 
However, current approaches lack explicit mechanisms to quantify and manage the uncertainty during translation process, resulting in the utilization of image information being a black box. This makes it difficult to effectively address the issues of incomplete utilization of visual information and even potential degradation of translation quality when using visual information.To address these challenges, we introduce a novel Uncertainty-Guided Multimodal Machine Translation (UG-MMT) framework that redefines how translation systems handle ambiguity through systematic uncertainty reduction. Designed with plug-and-play flexibility, our framework enables seamless integration into existing MMT systems, requiring minimal modification while delivering significant performance gains.

\end{abstract}

\section{Introduction}\label{sec:intro}
In traditional machine translation models, encountering sentences such as "a man is walking on the bank" can often lead to easily deducing that "bank" means "riverbank" due to the presence of "on." However, text alone does not always provide sufficient information to resolve the ambiguity of certain words. This is where the concept of multimodal translation emerges. \citet{yao-wan-2020-multimodal} defined multimodal machine translation (MMT) as a novel machine translation task that aims to design better translation systems using context from the additional image modality.

In recent years, the realm of MMT has made significant strides to enhance translation accuracy by incorporating visual data alongside textual inputs. Various models have been developed to harness these additional modalities, each offering unique approaches to improving translation performance. For example, the \textit{Multimodal Transformer} \citep{yao-wan-2020-multimodal} employs cross-modal attention to dynamically align relevant image regions with the corresponding parts of the text being translated, thereby enhancing contextual interpretation. Innovative techniques like \textit{Inversion Knowledge Distillation} \citep{peng2023distillimagenowhereinversion} improves MMT outputs by distilling image information, thus minimizing the need for direct visual input during inference. Additionally, Valhalla \citep{li2022valhalla} leverages visual hallucination strategies, generating more robust translations by simulating visual contexts even in the absence of actual visual data. 

% \begin{table}[t]
% \centering
% \caption{Visual Gate Weights in MMT Models after Training}
% \begin{tabular}{lccc}
% \toprule
% Models & Test2016 & Test2017 & MSCOCO \\
% \midrule
% Gated Fusion & 4.5e-21 & 7.0e-17 & 9.7e-21 \\
% RMMT & 8.6e-13 & 4.0e-13 & 3.5e-14 \\
% \bottomrule
% \end{tabular}
% \label{table:gate_weights}
% \end{table}
%We experimentally identified a significant shortcoming in current MMT methods regarding the effective utilization of visual information: %(1) During training, the gating weights for the visual modality diminish to near-zero in later epochs, effectively reducing these multimodal systems to text-only translation models (as shown in Table~\ref{table:gate_weights}); (2) Even more concerning, the addition of visual information often increases translation uncertainty. 
Although methodological advancements have improved translation performance, empirical analyses reveal certain limitations in current approaches. Our evaluation shows that instances of elevated uncertainty account for approximately 45\% of cases in the Gated Fusion model \citep{wu2021goodmisconceivedreasonsempirical} and 37\% of cases in the Revisit MMT model \citep{wu2021goodmisconceivedreasonsempirical}. This phenomenon is deeply concerning, as it conflicts directly with the foundational principle of multimodal machine translation (MMT). The core objective of MMT is to leverage visual modality to reduce ambiguity in text , assisting in resolving linguistic uncertainties that purely textual models struggle to address.
% \begin{table}[t] 
% \centering 
% \caption{Proportion of Instances Where Visual Modality Increased Uncertainty} 
% \begin{tabular}{lc} 
% \toprule 
% Models & Instances with Increased Uncertainty \\ 
% \midrule 
% Gated Fusion & 45\% of cases \\ 
% Revisit MMT  & 37\% of cases \\ 
% \bottomrule 
% \end{tabular}
% \label{table:uncertainty change}
% \vspace{0.5em}
% \caption*{\footnotesize Note: The proportions are calculated based on the number of instances where uncertainty increased versus decreased in the final epoch of training.}
% \end{table}

The key issue is that existing models lack a clear metric for expressing and measuring uncertainty, making it impossible to quantify whether visual information actually helps in reducing ambiguity. This stands in contrast to other domains, such as multi-class classification tasks \citep{sensoy2018evidential}, where uncertainty modeling is well-established. Nevertheless, recent works such as MAP \citep{ji2023map} and UNO \citep{tian2020uno} have demonstrated that uncertainty modeling can also be effectively leveraged in multimodal learning tasks. %Similarly, in multimodal decision-making systems, explicit uncertainty estimation helps identify modality conflicts, improving overall prediction robustness. However, uncertainty modeling in these domains benefits from specific properties like finite output spaces and bounded interactions between modalities, such as vision and text.
However, the application of uncertainty modeling to machine translation remains largely unexplored. This is primarily due to the intrinsic complexities of the translation process. First, machine translation involves generating sequences across an effectively infinite output space, where conventional uncertainty modeling metrics such as probability distributions over finite class sets become inapplicable. Second, ambiguities can propagate across sequence tokens, creating cascading uncertainties that traditional approaches fail to address. Third, the cross-modal interactions in MMT involve richer contextual dependencies, often introducing unpredictable noise rather than resolving ambiguities. For instance, visual context may mislead the model when images contain irrelevant or conflicting information.

%Addressing these challenges requires rethinking how uncertainty is modeled in multimodal scenarios. While other multimodal tasks address confidence at the final decision level, machine translation demands token-level and sequence-level uncertainty integration, ensuring that cross-modal fusion dynamically adapts to both textual and visual ambiguities. Furthermore, unlike tasks with static decision boundaries, translation requires adaptive uncertainty management over sequential steps to maintain contextual coherence.

To address these gaps, we propose the Uncertainty-Guided Multimodal Machine Translation (UG-MMT) framework. By explicitly modeling uncertainty at the token and sequence levels, UG-MMT not only quantifies the contribution of visual information but also guides the cross-modal fusion process toward consistent ambiguity reduction, addressing the fundamental challenges outlined above.

Our approach introduces three key innovations:
\begin{itemize}
    \item  A novel uncertainty modeling framework specifically designed for sequence generation, which captures ambiguity at both token and sentence levels while handling the infinite output space of translation.
    \item An uncertainty-guided cross-modal fusion mechanism that explicitly optimizes for uncertainty reduction, ensuring visual information serves its intended purpose of disambiguation.
    \item Comprehensive validation demonstrating that UG-MMT significantly outperforms existing approaches across multiple datasets and metrics, achieving state-of-the-art performance on several standard benchmarks.
\end{itemize}

Through extensive experiments on established MMT frameworks like Gated Fusion and Revisit-MMT \citep{wu2021goodmisconceivedreasonsempirical}, we demonstrate that our uncertainty-guided approach consistently improves performance across all evaluation metrics, validating the effectiveness of explicit uncertainty modeling in multimodal translation.



\section{Related Work}
\subsection{Multimodal Machine Translation}
Multimodal Machine Translation extends traditional neural machine translation by introducing additional modalities, such as images, to reduce linguistic ambiguity and improve translation robustness. Recent studies have explored various strategies for integrating visual information into translation pipelines, which can be broadly categorized into three main approaches.

The first category involves feature concatenation methods, where visual and textual features are simply combined during encoding. For example, \citet{yao-wan-2020-multimodal} utilized a multi-modal Transformer framework, concatenating the visual features from pre-trained image embeddings with textual inputs. Similarly, \citet{takushima2019multimodal} incorporated global image embeddings to construct multimodal feature representations for improving translation performance.

The second key direction leverages cross-modal interactive attention mechanisms. \citet{nishihara2020supervised} enhanced translation by allowing the model to attend to both token-level textual information and region-level visual features. \citet{zhao2022region} further extended this line of research by proposing a cross-modal interaction module, integrating visual and textual features through region-level and word-level attention mechanisms.

The third prominent approach is the gated fusion mechanism, which dynamically controls the contributions of different modalities based on their contextual importance. For instance, \citet{wu2021goodmisconceivedreasonsempirical} proposed a multimodal fusion method using independently encoded text and image representations, integrating them through a gating mechanism. Building upon this, \citet{lin2020dynamic} designed a model that incorporated dynamic context-guided capsule networks for robust visual feature extraction, followed by a gating mechanism to align and fuse modalities.

Despite these developments, prior work has largely focused on architectural improvements to enhance multimodal embeddings without explicitly addressing the uncertainty inherent in cross-modal alignment. As noted in Table~\ref{table:uncertainty change after}, visual information can sometimes increase, rather than decrease, translation uncertainty, emphasizing the need for uncertainty-aware multimodal translation frameworks.

\subsection{Multimodal Uncertainty Learning}
Uncertainty modeling has become an essential tool for quantifying model confidence and managing ambiguity, particularly in high-stakes AI applications. Traditional uncertainty estimation techniques are well-established in tasks like multi-class classification~\citep{sensoy2018evidential}, where methods such as Dirichlet-based evidential learning enable models to represent and quantify classification uncertainty effectively.

Recently, the field has expanded to multimodal uncertainty learning, focusing on integrating uncertainty estimation into multimodal tasks. For instance, \citet{jung2024beyond} proposed a Bayesian framework for generalizing uncertainty estimation to multimodal settings, achieving state-of-the-art results in uncertainty-aware learning. \citet{ji2023map} introduced MAP, a multimodal uncertainty-aware vision-language pre-training model, modeling sequence-level interactions between visual and textual data to align probabilistic representations. \citet{gao2024embracing} further emphasized the importance of aleatoric uncertainty in multimodal fusion, demonstrating its impact on improving prediction robustness across different modalities.

Beyond vision-language models, specific multimodal applications such as emotion recognition~\citep{chen2022modeling} and intention detection~\citep{trick2019multimodal} have introduced task-specific uncertainty modeling frameworks. \citet{chen2022modeling} designed a hierarchical uncertainty module that captures both context-level and modality-level uncertainties, enabling more accurate predictions in conversational scenarios. \citet{trick2019multimodal} proposed an uncertainty-reduction pipeline for intention recognition, demonstrating that explicit cross-modal uncertainty management significantly improves system robustness.

Furthermore, \citet{ott2018analyzing} and \citet{wang2020inference} primarily focus on uncertainty in neural machine translation, addressing uncertainty calibration from the perspectives of data distribution and the inference stage, respectively. \citet{ott2018analyzing} leverages uncertainty estimation tools to measure distributional discrepancies in the data and tackles the issue in NMT models where low-frequency words are assigned low probabilities in the predictive distribution, resulting in a lack of diversity in the translation outputs. \citet{wang2020inference} proposes a stepwise label smoothing method to quantify confidence calibration bias in NMT during the inference stage. 

Despite these advancements, the application of multimodal uncertainty learning to complex generation tasks, such as machine translation, remains underexplored. Unlike classification tasks with finite output classes, translation involves generating sequences over an effectively infinite output space, where ambiguities propagate across tokens. This fundamental challenge necessitates new strategies for token-level uncertainty modeling and sequence-level uncertainty fusion, as proposed in this paper.

\section{Methods}
% \subsection{Gated Fusion and Revisit MMT}
% The Gated Fusion model\citep{wu2021goodmisconceivedreasonsempirical} enhances translation by utilizing a gating mechanism to integrate visual and textual information. The textual representations \(H_{\text{text}}\) are fused with visual embeddings via a gating matrix \(\Lambda\), calculated as follows:
% \begin{equation} \Lambda = \sigma(W_\Lambda \text{Embed image}(z) + U_\Lambda H_{\text{text}}) \end{equation}
% The output representation for translation is then given by:
% \begin{equation} H = H_{\text{text}} + \Lambda \cdot \text{Embed image}(z) \end{equation}
% where \(\sigma\) is the sigmoid function, \(W_\Lambda\) and \(U_\Lambda\) are learnable parameters, and \(\text{Embed image}(z)\) denotes the visual feature embeddings obtained from a pre-trained CNN model.

% The Revisit model leverages image retrieval to enhance translation accuracy, formulated as:
% \begin{equation} p(y | x, z) = \prod_{i} p_\theta(y_i | x, z, y_{<i}) \end{equation}
% This formulation allows the incorporation of the most relevant visual content to aid each translation step.

\subsection{Uncertainty Learning for Translation models}
 To achieve our goal of reducing uncertainty through new modalities in MMT, it is crucial to first enable the model to quantify uncertainty. In multi-class classification tasks, modeling uncertainty using Dirichlet distributions has been demonstrated to be effective and well-established \citep{sensoy2018evidential}. However, machine translation presents unique challenges for uncertainty modeling. Unlike classification, which operates on fixed and bounded label spaces, translation involves generating sequences from immense, almost infinite vocabularies. Each token-level prediction depends not only on the source input but also on all prior tokens in the output sequence, creating cascading dependencies. This sequential nature amplifies uncertainty, as small ambiguities in earlier tokens can propagate and impact subsequent predictions. Furthermore, the integration of visual modalities introduces richer yet noisier feature spaces, increasing the complexity of precise uncertainty estimation. To address uncertainty in this complex setting, we build on the principles of evidential learning but adapt them for sequence generation tasks. The natural starting point is the transformation of logits—predicted by each time step during translation—into probabilistic models that quantify evidence for class predictions. 

In transformer-based translation models, the final outputs are typically a series of logits, which are the unnormalized scores for each word in the vocabulary. These logits represent the raw predictions of the model before any normalization or activation is applied. Upon passing these logits through a softmax layer, they are converted into probabilities, indicating the likelihood of each word being the correct one at a given position. 

Essentially, this transformation turns the task into a multi-class classification problem, where each position in the sequence corresponds to a different class. Given this setup, the logits, denoted as \(z\), can be considered as the model's predictions before normalization. We utilize these logits by transforming them into evidence values using a ReLU activation function. Specifically, logit \(z_w\) for each token is processed with the ReLU function to obtain the evidence \(e_w = \mathrm{ReLU}(z_w)\). Following this transformation, the evidence \(e_w\) is augmented by adding one, resulting in the Dirichlet parameters \(\alpha_w = e_w + 1\). This augmentation step is necessary to satisfy the properties of the Dirichlet distribution, where the parameters \(\alpha_w\) must each be greater than zero. These Dirichlet parameters are then utilized in further computations, allowing us to model the uncertainty and variability inherent in the translation task. 

% By converting logits into evidence and subsequently to Dirichlet parameters, we leverage the natural multi-class characteristics of the logits, aligning them with a probabilistic framework that can effectively manage uncertainty and provide robust predictions.

Given these parameters, the uncertainty \(u\) and belief masses \(b_w\) for each word \(w\) in the vocabulary are formulated as follows:
\begin{equation}
b_w = \frac{\alpha_w - 1}{S} \quad \text{and} \quad u = \frac{V}{S},
\end{equation}
where \(S = \sum_{w=1}^{V} \alpha_w\) represents the Dirichlet strength and \(V\) denotes the size of the target vocabulary. The uncertainty is inversely related to the total evidence, encapsulating the ``I do not know'' stance when evidence is low.

While these formulations offer a sound theoretical basis for capturing uncertainty, their direct application to machine translation tasks proved challenging due to the nature of sequence generation. A naive implementation of uncertainty modeling would employ the Kullback-Leibler (KL) divergence term in the loss function to align the predicted Dirichlet distribution with the target probabilities. However, this method often over-penalizes high uncertainty predictions and discourages exploration in ambiguous contexts. This issue becomes particularly noticeable in translation tasks, where the immense vocabulary size amplifies the impact of such penalties. Tokens associated with synonyms, polysemy, or cultural nuances inherently exhibit higher uncertainty, and over-penalizing these cases can hinder the model’s ability to flexibly adapt to the diversity and complexity of language.

Also, relying solely on an uncertainty loss based on Dirichlet distribution, without the standard cross-entropy loss, can lead to significant performance degradation during training. Cross-entropy explicitly measures the probability assigned by the model to the correct ground-truth token at each time step. This ensures token-level precision by driving the logits distribution \(p(y_i|\text{context})\) toward the correct output \(y_i\). In translation tasks, where maintaining token-by-token alignment is critical, such direct supervision is indispensable. Uncertainty losses like \( \mathcal{L}_{\text{err}} \)~\ref{eq:Lerr}, on the other hand, focus on minimizing the error between predicted distributions and confidence scores, while  \( \mathcal{L}_{\text{var}} \)~\ref{eq:Lvar}  regularizes the Dirichlet variance to prevent overconfident predictions. While these components improve calibration of uncertainty, they do not explicitly enforce the correct token being predicted, thus failing to address token-level precision challenges in sequence generation tasks.  %By replacing cross-entropy with uncertainty-driven losses, the model lacked a direct objective to align the predicted token distribution with the ground truth. This led to poor token-level precision and degraded the overall translation performance.

To address these challenges, we adopted a hybrid approach. Instead of treating uncertainty as the primary optimization goal, it was incorporated as an auxiliary regularization term into the standard cross-entropy loss. This adjustment preserved the strengths of cross-entropy for token-level accuracy while leveraging uncertainty regularization to calibrate the predictions for ambiguous tokens. Specifically, our final loss function comprises:

\begin{itemize} 
\item \(\mathcal{L}_{\text{err}}\): This measures the squared error between the true token distribution and the model’s predicted token probabilities across all tokens in a sentence:
\begin{equation}
\mathcal{L}_{\text{err}} = \sum_{i=1}^{N} \sum_{w \in V} (y_{iw} - \hat{p}_{iw})^2,
\label{eq:Lerr}
\end{equation}
where \(N\) represents the number of tokens in the given sentence, \(y_{iw}\) is the one-hot encoded true distribution for the \(i\)-th token \(w\), and \(\hat{p}_{iw}\) is the predicted probability for token \(w\).

\item\(\mathcal{L}_{\text{var}}\): This captures the uncertainty in predictions by incorporating the variance from the Dirichlet distribution across all tokens in the sentence:
\begin{equation}
\mathcal{L}_{\text{var}} = \sum_{i=1}^{N} \sum_{w \in V} \frac{\hat{p}_{iw} (1 - \hat{p}_{iw})}{S_i + 1},
\label{eq:Lvar}
\end{equation}
where \(S_i = \sum_{w \in V} \alpha_{iw}\) represents the evidence (Dirichlet strength) for the \(i\)-th token in the sentence.
\end{itemize}

The application of Equation~\ref{eq:Lvar} is pivotal in making the model mathematically more confident in leveraging image data effectively to manage uncertainty in multimodal contexts.However, simply adding this term could easily lead to overconfidence on prediction, thus limiting the performance improvement. As a resolution, we adopted label-smoothed cross-entropy, also used by our baseline models, which prevents the model from becoming overly confident by distributing small probabilities to incorrect options. This allowed us to manage uncertainty effectively without imposing rigid constraints on prediction distributions, thus maintaining the quality of the translations.

The total loss function is thus a combination of these two components:
\begin{equation}
L(\Theta) = \mathcal{L}_{\text{CE}} + \lambda_1 \mathcal{L}_{\text{err}} + \lambda_2 \mathcal{L}_{\text{var}}.
\label{eq:Ltheta}
\end{equation}
This formulation enables the model to simultaneously minimize prediction errors and account for uncertainty, thereby calibrating the confidence in its predictions in a more comprehensive manner. By effectively utilizing the multimodal translation logits and the optimized loss function, our approach significantly improves the robustness of classification decisions under uncertain conditions.

\subsection{Uncertainty-Guided Multimodal Machine Translation} 

\begin{figure*}[!ht]
    \centering
    
    \includegraphics[width=\textwidth]{uai2025-template/Figures/figure1.drawio.pdf}
    \caption{Architecture of the proposed Uncertainty-Guided Multimodal Machine Translation (UG-MMT) framework. 
    The left side of the figure illustrates the multimodal translation pipeline, incorporating both textual and visual features. 
    Text sequences are processed via word and positional embeddings, while images are transformed into visual embeddings. 
    These features are fused using a Gated Fusion mechanism before being passed through the Transformer decoder for sequence generation. 
    The right panel highlights the uncertainty modeling process. Text-only and multimodal logits are transformed into evidence values via the ReLU activation function. A higher evidence value indicates stronger confidence, resulting in lower uncertainty. The figure also shows the computation of the relative uncertainty difference ($\Delta u$), where the color depth reflects the magnitude of $\Delta u$. Specifically, when multimodal uncertainty ($u_{\text{multi}}$) exceeds text-only uncertainty ($u_{\text{text}}$), $\Delta u > 0$, shown as deeper-colored nodes. In contrast, when $u_{\text{multi}} \leq u_{\text{text}}$, $\Delta u = 0$ due to the ReLU activation, effectively ignoring such cases.
    The uncertainty loss incorporates both absolute multimodal uncertainty ($u_{\text{multi}}$) and relative uncertainty difference ($\Delta u$), guiding the model to leverage visual features effectively for ambiguity reduction.
    %Our approach ensures that the visual modality contributes meaningfully to disambiguating linguistic ambiguities in the translation process
    }
    \label{fig:ug-mmt-architecture}
\end{figure*}

The overall architecture of our proposed framework is illustrated in Figure~\ref{fig:ug-mmt-architecture}. This framework integrates both textual and visual modalities through a Gated Fusion mechanism. Text sequences are processed via word and positional embeddings, while images are transformed into visual embeddings. These features are dynamically fused and passed through a Transformer decoder for sequence-level language generation.

Incorporating additional modalities like images into translation tasks is intended to help disambiguate and reduce uncertainty, thereby improving translation accuracy. However, after successfully integrating uncertainty modeling into the multimodal translation task, we observed that the inclusion of images did not consistently result in reduced uncertainty across various scenarios. %, see in Table~\ref{table:uncertainty change}\%. 
This observation indicated that the model was not effectively leveraging images to resolve textual ambiguities, contradicting the fundamental goal of multimodal translation. From the data presented in Table~\ref{table:uncertainty change after}, we can infer that the current translation models show some potential for using the visual modality to reduce uncertainty. %However, this objective remains somewhat ambiguous and under-realized, as the proportions indicate that increased uncertainty still occurs in a significant number of instances. 
This suggests that while the models have the capacity to improve translation accuracy through multimodal integration, the strategies for leveraging visual information are not yet fully optimized.

To address this challenge, we sought a metric to assess the images' impact on reducing uncertainty. We decided on the uncertainty difference between translations with and without images. Our objective was that the inclusion of images should consistently lead to lower uncertainty. Initially, a straightforward approach was to incorporate this difference as a regularization term in the loss function. Yet, merely maximizing this difference risked the model overemphasizing the role of images. Therefore, to prevent such an imbalance, we applied the ReLU function to this difference, ensuring the regularization effect only activates when the multimodal uncertainty surpasses the text-only uncertainty:
\begin{equation}
\Delta u = \text{ReLU}\left(\frac{u_{\text{multi}}}{u_{\text{text}} + \epsilon} - 1.0\right)
\end{equation}


where $u_{\text{multi}}$ represents the uncertainty in multimodal translation, $u_{\text{text}}$ represents the uncertainty in text-only translation, and $\epsilon$ is a small constant added to avoid division by zero. The ratio reflects the relative change in uncertainty between multimodal and text-only settings, ensuring that only when multimodal uncertainty surpasses text-only uncertainty does the regularization term activate. Compared to directly subtracting these uncertainties ($\Delta u = u_{\text{multi}} - u_{\text{text}}$), this ratio-based approach provides smoother and more balanced adjustments. By normalizing the uncertainties, it ensures that their relative contributions are independent of their magnitude scales, mitigating sensitivity to large or small absolute values. Additionally, it avoids abrupt gradient contributions common with simple subtraction, enhancing training stability and preventing the model from over-relying on images. Finally, the use of ReLU further restricts optimization to cases where multimodal uncertainty truly exceeds text-only uncertainty, ensuring the regularization targets meaningful scenarios aligned with reducing overall ambiguity.

Another critical component of the loss function is \(u_{\text{multi}}\), which explicitly penalizes high multimodal uncertainty during training. This term is crucial for ensuring that the additional modalities, particularly the visual inputs, actively contribute to reducing ambiguity within the translation process. Without directly enforcing an uncertainty penalty, the model might ignore the uncertainty from multimodal inputs or fail to optimize it effectively.

While \(\Delta u\) encourages relative uncertainty reduction to optimize the visual modality's contribution, \(u_{\text{multi}}\) focuses on the absolute multimodal uncertainty, playing a complementary role in the loss function. Incorporating \(u_{\text{multi}}\) ensures that the multimodal system minimizes overall uncertainty in every scenario, independent of the relative differences between modalities. This term serves several important purposes.

First, minimizing \(u_{\text{multi}}\) directly penalizes high levels of multimodal uncertainty, driving the model toward producing sharper and more confident probability distributions. These sharper distributions improve token-level precision during sequence generation, aligning with the goal of increasing prediction accuracy. By enforcing this absolute certainty, the model learns to construct more robust feature representations from both the textual and visual inputs. Second, \(u_{\text{multi}}\) prevents potential exploitation of the \(\Delta u\) term. When only a relative uncertainty difference is regularized, the model might retain an overall high uncertainty in multimodal predictions while artificially lowering \(\Delta u\). This could undermine the true goal of reducing ambiguities. The inclusion of \(u_{\text{multi}}\) ensures that uncertainty optimization is not just relative but also absolute, pushing the system toward reliably low uncertainty in multimodal contexts.

Overall, the inclusion of \(u_{\text{multi}}\) complements \(\Delta u\) by addressing both the absolute uncertainty minimization and the relative uncertainty difference, ensuring a more balanced and principled approach to optimizing multimodal predictions.

To integrate this into our training process, we define the loss function as follows:
\begin{equation}
\mathcal{L} = u_{\text{multi}} + \beta \cdot \Delta u +\lambda * L(\Theta)
\end{equation}

where \( \beta \) is a scaling factor dependent on the training epoch, and \( L(\Theta) \) represents the regularization term defined in the previous Section~\ref{eq:Ltheta}.

\begin{algorithm} \caption{Uncertainty-Guided Multimodal Machine Translation} \begin{algorithmic}[1] \Require Text logits \(z_{\text{text}}\), Multimodal logits \(z_{\text{multi}}\) \Ensure Effectively use the new modality to reduce uncertainty \State \(e_{\text{text}} \gets \text{ReLU}(z_{\text{text}}) + 1\) \State \(e_{\text{multi}} \gets \text{ReLU}(z_{\text{multi}}) + 1\) \State \(u_{\text{text}} \gets \frac{V}{\sum e_{\text{text}}}\) \State \(u_{\text{multi}} \gets \frac{V}{\sum e_{\text{multi}}}\) \State \(\Delta u \gets \text{ReLU}\left(\frac{u_{\text{multi}}}{u_{\text{text}} + \epsilon} - 1.0\right)\) \State \(\mathcal{L} \gets u_{\text{multi}} + \beta \cdot \Delta u + L(\Theta)\) \end{algorithmic}
\end{algorithm}
\section{Experiments}

\subsection{DataSet}
In this section, we evaluate our framework with the widely used Multi30K benchmark \citep{W16-3210}. The training and validation sets consisted of 29, 000 and 1,014 instances. We evaluate on TEST2016, TEST2017 (1,000 instances), and MSCOCO \citep{elliott-EtAl:2017:WMT}  (461 challenging out-of-domain samples). As the process in the  project \citep{wu2021goodmisconceivedreasonsempirical},We merge the source and target sentences in the officially preprocessed version of Multi30k to build a joint vocabulary. We then apply the byte pair encoding (BPE) algorithm \citep{sennrich2015neural} with 10,000 merging operations to segment words into subwords, which generates a vocabulary of 9,712(9,544) tokens for En-De (En-Fr).

\subsection{Setup}

Our experimental setup closely follow the methodologies described in the papers of Gated Fusion and Revist MMT, ensuring consistent variable control to effectively highlight the impact of our introduced component.  For optimization, we used the Adam optimizer with hyperparameters $\beta_1 = 0.9$ and $\beta_2 = 0.98$. The learning rate initially increased linearly from $10^{-7}$ to 0.005 during the warm-up phase and then decayed in proportion to the number of updates.

Each training batch was composed of up to 16,384 source/target tokens. We applied label smoothing with a weight of 0.1 and a dropout rate of 0.3 to prevent overfitting. Training was scheduled to halt early if the validation loss did not improve over 10 consecutive epochs \citep{zhang2020neural}. During inference, we averaged the results of the last 10 checkpoints and performed beam search with a beam size of 5 to select the best translation candidates. Evaluation metrics included 4-gram BLEU and METEOR scores across all test sets. All models are trained and evaluated on one single
machine with one RTX 4090 GPU (5-10 minutes for the entire training process).
\begin{table}[!ht]
    \centering
    \caption{Proportion of Instances Where Visual Modality Increased Uncertainty}
    \resizebox{\columnwidth}{!}{% Begin resizing
    \begin{tabular}{lcc}
        \toprule
        Models & Before UG-MMT Integration & After UG-MMT Integration \\ 
        \midrule
        Gated Fusion & 45.0\% of cases & 0.0\% of cases \\ 
        Revisit MMT  & 37.0\% of cases & 0.0\% of cases \\ 
        \bottomrule
    \end{tabular}
    }% End resizing
    
    \label{table:uncertainty change after}
    \vspace{0.5em}
\end{table}

\subsection{Relults}
\begin{table*}[t]
\centering
\caption{Comparison with existing MMT systems on Multi30K dataset (B: BLEU, M: METEOR)}
\resizebox{\textwidth}{!}{
\begin{tabular}{l|cccccc|cccccc}
\toprule
System & \multicolumn{6}{c|}{En$\rightarrow$De} & \multicolumn{6}{c}{En$\rightarrow$Fr} \\
& \multicolumn{2}{c}{Test2016} & \multicolumn{2}{c}{Test2017} & \multicolumn{2}{c|}{MSCOCO} & \multicolumn{2}{c}{Test2016} & \multicolumn{2}{c}{Test2017} & \multicolumn{2}{c}{MSCOCO} \\
 & B & M & B & M & B & M & B & M & B & M & B & M \\
\midrule
\multicolumn{13}{c}{\textit{Existing Traditional MMT Systems}} \\
\midrule
Multimodal Self-attn \citep{yao-wan-2020-multimodal} & 41.02 & - & 33.36 & - & 29.88 & - & 61.8 & - & 53.46 & - & 44.52 & - \\
Gated Fusion\textsuperscript{$\diamond$} \citep{wu2021goodmisconceivedreasonsempirical}& 41.56 & 68.17 & 32.74 & 60.99 & 29.04 & 56.00 & 61.05 & 80.1 & 54.09 & 75.47 & 44.25 & 69.12 \\
Revisit MMT\textsuperscript{$\diamond$} \citep{wu2021goodmisconceivedreasonsempirical}& 40.8  & 68.01 & 32.94 & 61.33 & 28.83 & 56.02 & 62.05 & 81.12 & 53.79 & 76.28 & 44.87 & 69.33 \\
%Selective Attn\citep{li2022visionfeaturesmultimodalmachine} & 41.84 & 68.64 & \textbf{34.32} & 61.42 & \textbf{30.22} & \textbf{56.91} & 62.24 & 81.41 & 54.52 & 76.30 & 44.82 & \textbf{70.63} \\
IKD-MMT \citep{peng2023distillimagenowhereinversion} & 41.28 & 58.93 & \textbf{33.83} & 53.21 & \textbf{30.17} & 48.93 & - & - & - & - & - & - \\
%VALHALLA(V)\citep{li2022valhalla} & 41.9 & 69.03 & \textbf{34.10} & \textbf{62.30} & \textbf{30.30} & \textbf{57.20} & 62.2 & 81.23 & 55.1 & 75.80 & \textbf{45.60} & \textbf{71.1} \\
MGNMT(TF PCL-O) \citep{YIN2023103986} & 40.4 & 58.4 & 32.5 & 52.0 & 29.0 & 48.5 & 61.3 & 75.8 & 54.4 & 70.7 & - & - \\
%ProMul-Trans\citep{10202177} &42.0& 59.4 &34.0&52.5& 30.2 &49.6 &62.3& 77.2 &54.0& 72.0& 45.3 &66.4 \\
%ConVisPiv\citep{GUO2024106403} & 42.64 & 60.56 & 34.84 & 54.72 & 29.69 & 50.12 & 62.56 & 77.09 & 55.83 & 73.18 & 46.10 & 67.67 \\
RG-MMT-EDC \citep{10401981} & 42.00 & 60.20 & 33.40 & 53.70 & 30.00 & 49.60 & \textbf{62.90} & 77.20 & \textbf{55.80} & 72.00 & 45.10 & 64.90 \\
\midrule
\textbf{UG-MMT+Gated Fusion} (Ours) & \textbf{42.82} & \textbf{69.11} & \underline{33.78} &\textbf{61.49} & 28.93 & \underline{56.03} & 62.01 & \underline{81.41} & 54.43 & \underline{76.47} & \textbf{45.31} & \textbf{69.93} \\
\textbf{UG-MMT+RMMT} (Ours) & \underline{42.01} & \underline{68.59} & 33.2 & \underline{61.44} & \underline{30.01} & \textbf{56.6} & \underline{62.28} & \textbf{81.56} & \underline{54.47} & \textbf{76.67} & \underline{45.16} & \underline{69.76} \\
\bottomrule
\end{tabular}   }
\vspace{0.5em}
\caption*{\footnotesize Note: \textsuperscript{$\diamond$} means to reproduce previous MMT methods based on the settings mentioned on experiment section. Best results are shown in \textbf{bold}, second best results are \underline{underlined}. `-' indicates unavailable results.}
\label{table:sota_comparison}
\end{table*}
To position our work within the broader context of multimodal translation research, we compared our approach with current state-of-the-art MMT models. Table~\ref{table:sota_comparison} shows the comparison results on the Multi30k dataset. Notably, our UG-MMT enhanced models achieved SOTA performance on the Test2016 dataset, with a BLEU score of 42.82 for En→De translation. This result not only validates the effectiveness of our uncertainty-guided approach but also demonstrates its potential to advance the field of multimodal translation.

%Furthermore, our method has been experimentally verified to successfully eliminate instances of increased uncertainty, achieving consistent uncertainty reduction across all cases. 
In our preliminary analysis of existing multimodal translation models, we observed a concerning phenomenon where visual information frequently led to increased uncertainty in translation decisions. After integrating our UG-MMT framework, we observe a dramatic shift in this pattern, which is shown in Table~\ref{table:uncertainty change after}. This transformation in uncertainty management precedes and directly contributes to improved translation performance, suggesting that uncertainty reduction serves as a driving force for enhanced translation quality rather than merely being a byproduct of better translations.This causal relationship between uncertainty reduction and performance improvement is further validated by our experimental results presented in Table~\ref{table:performance}. The integration of UG-MMT yields substantial improvements across multiple evaluation metrics. Particularly, for the En→De translation task, we observe improvements of up to 1.26 BLEU points on Test2016 with Gated Fusion, while RMMT shows similar positive trends with a 1.21 BLEU point increase. These improvements are notably consistent across different test sets and language pairs, demonstrating the robustness of our uncertainty-guided approach.

To further validate whether uncertainty-guided translation truly leads to more accurate and contextually appropriate translations, we conducted a detailed qualitative analysis. The examples in Table~\ref{tab:example2} provide concrete evidence of improved translation quality. In the first example, UG-MMT demonstrates superior verb disambiguation, correctly translating "scanning" where the baseline model incorrectly used "winning". The second example showcases improved handling of complex scene understanding, with more precise role identification and better context integration.


\begin{table*}[t] % The * indicates a two-column table, and [t] places it at the top of the page 
\centering  
    \caption{Effect of integrating UG-MMT into Gated Fusion and RMMT models on BLEU scores for En$\rightarrow$De and En$\rightarrow$Fr tasks.}
\begin{tabular}{c|l|c|c|c|c|c|c} 
\toprule & \multicolumn{1}{c|}{Model} & \multicolumn{3}{c|}{En$\rightarrow$De} & \multicolumn{3}{c}{En$\rightarrow$Fr} \\ \cline{3-8} \# & & Test2016 & Test2017 & MSCOCO & Test2016 & Test2017 & MSCOCO \\ 
\midrule \multicolumn{8}{c}{\textit{Baseline Models}} \\
\midrule 1 & Gated Fusion & 41.56 & 32.74 & 29.04 & 61.05 & 54.09 & 44.25 \\  
2 & RMMT & 40.8 & 32.94 & 28.83 & 62.05 & 53.79 & 44.52 \\
\midrule \multicolumn{8}{c}{\textit{Baseline Models With UG-MMT}} \\
\midrule3 & Gated + UG & 42.82 $\textcolor{green}{\uparrow}$\textcolor{green}{1.26} & 33.78 $\textcolor{green}{\uparrow}$\textcolor{green}{1.04} & 28.93 $\textcolor{red}{\downarrow}$\textcolor{red}{0.11} & 62.01 $\textcolor{green}{\uparrow}$\textcolor{green}{0.96} & 54.43 $\textcolor{green}{\uparrow}$\textcolor{green}{0.34} & 45.31 $\textcolor{green}{\uparrow}$\textcolor{green}{1.06} \\
4 & RMMT + UG & 42.01 $\textcolor{green}{\uparrow}$\textcolor{green}{1.21} & 33.2$\textcolor{green}{\uparrow}$\textcolor{green}{0.26} & 30.01 $\textcolor{green}{\uparrow}$\textcolor{green}{1.18} & 62.33 $\textcolor{green}{\uparrow}$\textcolor{green}{0.28} & 54.47 $\textcolor{green}{\uparrow}$\textcolor{green}{0.68} & 45.16 $\textcolor{green}{\uparrow}$\textcolor{green}{0.64}\\ \bottomrule \end{tabular}
\caption*{\footnotesize Note: All baseline models were re-implemented and evaluated in our experimental environment using identical hyperparameters as specified in Section 4.1. Green arrows ($\textcolor{green}{\uparrow}$) indicate improvements over our re-implemented baselines, while red arrows ($\textcolor{red}{\downarrow}$) indicate decreased performance.}
    \label{table:performance}
\end{table*}

\begin{table*}[t]
\centering
\caption{Example of translation improvement using UG-MMT}
\renewcommand{\arraystretch}{1.2} % 增加行高
\setlength{\tabcolsep}{6pt} % 增加列间距
\resizebox{\textwidth}{!}{
\begin{tabular}{c p{0.8\textwidth}}
\hline
\raisebox{-1.3\height}{\includegraphics[width=0.15\textwidth]{uai2025-template/images/example1.jpg}} & % 图片垂直居中
\begin{tabular}[t]{@{}l@{}}
\textbf{SRC:} \underline{the gentleman is scanning the image} that the woman in the blue shirt is providing him. \\

\textbf{MMT:} der herr \sout{gewinnt}  das bild der frau im blauen hemd, \sout{die ihn anhält}. \\

\textit{(The gentleman wins the image of the woman in the blue shirt who stops him.)} \\

\textbf{UG-MMT:} der herr \textbf{scannt} das bild von der frau im blauen hemd. \\
\textit{(The gentleman scans the image from the woman in the blue shirt.)} \\

\textbf{REF:} der herr scannt das bild, das ihm die frau im blauen hemd zeigt. \\
\textit{(The gentleman scans the image that the woman in the blue shirt shows him.)}
\end{tabular} \\
\end{tabular}
}
\resizebox{\textwidth}{!}{
\begin{tabular}{c p{1\textwidth}}
\hline
\raisebox{-0.9\height}{\includegraphics[width=0.15\textwidth]{uai2025-template/images/example2.jpg}} & % 图片垂直居中
\begin{tabular}[t]{@{}l@{}}
\textbf{SRC:} \underline{a clerk in a convenience store} asks \underline{a customer buying alcohol} for his age and identification. \\

\textbf{MMT:} ein \sout{kunde} in einem \sout{nachbarschaftsladen schreibt}  einen kunden \sout{für seine alkohol und identisch}. \\
\textit{(A customer in a neighborhood store writes a customer for his alcohol and identical.)} \\

\textbf{UG-MMT:} ein \textbf{verkäufer} in einem \textbf{laden fragt einen kunden nach seinem ausweis} beim alkoholkauf. \\
\textit{(A clerk in a store asks a customer for his identification when buying alcohol.)} \\

\textbf{REF:} ein mitarbeiter in einem laden fragt einen kunden, der alkohol kauft, nach seinem alter und einem ausweis. \\
\textit{(An employee in a store asks a customer who is buying alcohol for his age and identification.)}
\end{tabular} \\
\hline
\end{tabular}}

\label{tab:example2}
\end{table*}

\section{Analysis}
\subsection{Ablation Study}
\begin{table}[h]
\centering
\caption{Ablation Study Results on Test2016}
\begin{tabular}{cccc|c}
\toprule
$u_{\text{multi}}$ & $L(\Theta)$ & $\Delta u$ & BLEU & $\Delta$ \\
\midrule
& & & 41.57 & - \\
\checkmark & & & 42.07 & \textcolor{green}{+0.50} \\
& \checkmark & & 41.79 & \textcolor{green}{+0.22} \\
& & \checkmark & 40.80 & \textcolor{red}{-0.77} \\
\checkmark & \checkmark & \checkmark & \textbf{42.82} & \textcolor{green}{+1.25} \\
\bottomrule
\end{tabular}
\label{tab:ablation}
\end{table}
To systematically evaluate the contribution of each component in the proposed UG-MMT framework, we performed ablation experiments using Gated Fusion as the baseline model. By introducing different components of UG-MMT (\(u_{\text{multi}}\), \(\Delta u\), and \(L(\Theta)\)) individually and in combination, we analyzed their effects on translation performance, measured by BLEU scores on the Test2016 dataset. The results of our experiments are summarized in Table~\ref{tab:ablation}.

Introducing \(u_{\text{multi}}\) alone led to a BLEU score improvement from 41.57 to 42.07 (\(+0.50\)). This improvement can be attributed to \(u_{\text{multi}}\) encouraging the model to minimize token-level uncertainty during sequence generation. By decreasing the overall uncertainty, the model is incentivized to prioritize predictions with stronger evidence \(e_k\). This inherent preference for confident predictions forces the model to output sharper probability distributions, favoring correct predictions while suppressing incorrect ones.The prioritization of low-uncertainty predictions amplifies the training signal during errors: when the model makes an incorrect prediction, the sharpness of the prediction results in a higher cross-entropy loss compared to normal settings. This reinforcement effect leads to better gradient signals, encouraging the model to improve its generation consistency over time. As a result, \(u_{\text{multi}}\) directly contributes to improving model convergence and robustness by aligning predictions with token-level confidence and evidence.

On the other hand, incorporating \(L(\Theta)\) alone improved the BLEU score to 41.79 (\(+0.22\)). The relatively modest improvement indicates that \(L(\Theta)\) acts primarily as a stabilizing regularization term rather than directly optimizing for accuracy. By leveraging Dirichlet-based evidential learning~\cite{sensoy2018evidential}, \(L(\Theta)\) helps to balance uncertainty distributions across predictions, particularly in challenging translation tasks. This regularization ensures that the uncertainty model remains well-calibrated, reducing overfitting while preparing the system to utilize uncertainty effectively in conjunction with other components.

When \(\Delta u\) was introduced as a standalone component, the BLEU score decreased to 40.80 (\(-0.77\)). This regression highlights the challenges of relying on cross-modal uncertainty differences without proper regularization. Specifically, \(\Delta u\), by definition, quantifies the difference in uncertainty levels between the textual and visual modalities. However, without any regularization term to ensure the correctness of the uncertainty estimates, \(\Delta u\) lacks reliability; it fails to capture the "true" uncertainty gap between modalities, and thus, cannot meaningfully guide the optimization process. In essence, \(\Delta u\) requires reliable and calibrated uncertainty estimates from both modalities to meaningfully quantify their disparity. Without such calibration, the model cannot accurately assess the comparative "value" of each modality for ambiguity resolution, leading to inconsistent predictions and compromised performance.

When all three components were integrated, the BLEU score increased significantly to 42.82 (\(+1.25\)), demonstrating the synergistic interaction among \(u_{\text{multi}}\), \(L(\Theta)\), and \(\Delta u\). Each component provides unique benefits:
\begin{itemize}
    \item \(u_{\text{multi}}\) encourages confident and low-uncertainty predictions at the token level, improving translation consistency.
    \item \(L(\Theta)\) ensures stable and well-calibrated uncertainty estimation, preventing overfitting and misalignment of cross-modal predictors.
    \item \(\Delta u\) reinforces the contribution of visual inputs by dynamically prioritizing uncertainty reduction across modal inputs, ensuring that image features are utilized effectively to resolve textual ambiguities.
\end{itemize}

The comprehensive framework not only improves prediction accuracy but also ensures that visual modality consistently contributes to reducing translation uncertainty, as demonstrated by the elimination of uncertainty increases observed in our error analysis.

\subsection{Understanding Uncertainty}
In this section, we aim to demonstrate through experiments that our Uncertainty-Guided Multimodal Machine Translation (UG-MMT) framework possesses the ability to comprehend and manage uncertainty effectively. To establish this capability, we need examples that elicit high uncertainty outputs alongside those that convey low uncertainty.

Given our focus on translation tasks, a common scenario that naturally arises is the out-of-vocabulary (OOV) situation. OOV refers to cases where the translation encounters words not present in the existing vocabulary, analogous to the occurrence of unseen categories in multiclass classification tasks. These situations should theoretically prompt high uncertainty outputs, indicating the model's recognition of unfamiliar terms. Conversely, words well within the vocabulary should yield lower uncertainty, showing confidence in prediction. Hence, we utilize the OOV scenario as a test to verify whether our model can accurately understand and express uncertainty.
\begin{figure}[!htb]
  \centering
  \includegraphics[width=\linewidth]{uai2025-template/Figures/hot_w_cbar.drawio.pdf}
  
  \caption{Example of UG-MMT handling uncertainty}\label{fig:uncertaintyvalue}
\end{figure}

Through our experiments, particularly on the Test2017 and MSCOCO datasets, we observed that OOV words consistently resulted in elevated uncertainty scores. This explicit signaling reflects the model's cautious approach when faced with unknowns, dynamically incorporating contextual cues to refine predictions. For instance (see in figure~\ref{fig:uncertaintyvalue}), in translating "a man in camouflage and a black hat mounting a horse," the term “camouflage”—absent from the dataset—induced a heightened uncertainty score (0.6), whereas more familiar terms like “man” showed minimal uncertainty (0.005). This distribution underscores the model's ability to distinguish between OOV words and familiar vocabulary, adapting its prediction strategy accordingly.

Quantitative analysis further confirmed that sentences containing high-uncertainty tokens typically achieved lower BLEU and METEOR scores. This correlation highlights the value of uncertainty flags in guiding the model to adjust its predictions amidst linguistic ambiguity. By enabling the system to recognize and act upon uncertainty, UG-MMT enhances both translation accuracy and reliability.

\section{Conclusion}
We proposed UG-MMT, an Uncertainty-Guided Multimodal Machine Translation framework that systematically integrates uncertainty modeling into multimodal translation tasks. By explicitly modeling token-level and sequence-level uncertainties, UG-MMT ensures effective utilization of visual information to disambiguate linguistic ambiguities. UG-MMT eliminates multimodal uncertainty and achieves SOTA performance on Multi30K. These results highlight the importance of combining uncertainty modeling with cross-modal fusion, paving the way for more robust applications of multimodal translation.


\begin{acknowledgements} % will be removed in pdf for initial submission,
						 % (without ‘accepted’ option in \documentclass)
                         % so you can already fill it to test with the
                         % ‘accepted’ class option
    This work was supported by the National Natural Science Foundation of China under Project 62476089.
\end{acknowledgements}

% References
\bibliography{uai2025-template}



\end{document}
