\documentclass{uai2023} % for initial submission
% \documentclass[accepted]{uai2023} % after acceptance, for a revised
                                    % version; also before submission to
                                    % see how the non-anonymous paper
                                    % would look like

%% There is a class option to choose the math font
% \documentclass[mathfont=ptmx]{uai2023} % ptmx math instead of Computer
% Modern (has noticable issues)
% \documentclass[mathfont=newtx]{uai2023} % newtx fonts (improves upon
 % ptmx; less tested, no support)
% NOTE: Only keep *one* line above as appropriate, as it will be replaced
%       automatically for papers to be published. Do not make any other
%       change above this note for an accepted version.

%% Choose your variant of English; be consistent
\usepackage[american]{babel}
% \usepackage[british]{babel}

%% Some suggested packages, as needed:
%\usepackage{natbib} % has a nice set of citation styles and commands
%    \bibliographystyle{plainnat}
%    \renewcommand{\bibsection}{\subsubsection*{References}}
\usepackage{mathtools} % amsmath with fixes and additions
% \usepackage{siunitx} % for proper typesetting of numbers and units
\usepackage{booktabs} % commands to create good-looking tables
\usepackage{tikz} % nice language for creating drawings and diagrams
\usepackage[numbers]{natbib}
% for cross referencing the main text
% PLEASE ONLY USE xr IN THE SUPPLEMENTARY MATERIAL. 
% In the main paper, hard code any cross-reference to the supplementary material. 
\usepackage{xr} 
\externaldocument{main}

%% Provided macros
% \smaller: Because the class footnote size is essentially LaTeX's \small,
%           redefining \footnotesize, we provide the original \footnotesize
%           using this macro.
%           (Use only sparingly, e.g., in drawings, as it is quite small.)

%% Self-defined macros
\newcommand{\swap}[3][-]{#3#1#2} % just an example

\title{Rebuttal}

% The standard author block has changed for UAI 2023 to provide
% more space for long author lists and allow for complex affiliations
%
% All author information is authomatically removed by the class for the
% anonymous submission version of your paper, so you can already add your
% information below.
%
% Add authors
\author[1]{\href{mailto:<jj@example.edu>?Subject=Your UAI 2023 paper}{Jane~J.~von~O'L\'opez}{}}
\author[1]{Harry~Q.~Bovik}
\author[1,2]{Further~Coauthor}
\author[3]{Further~Coauthor}
\author[1]{Further~Coauthor}
\author[3]{Further~Coauthor}
\author[3,1]{Further~Coauthor}
% Add affiliations after the authors
\affil[1]{%
    Computer Science Dept.\\
    Cranberry University\\
    Pittsburgh, Pennsylvania, USA
}
\affil[2]{%
    Second Affiliation\\
    Address\\
    …
}
\affil[3]{%
    Another Affiliation\\
    Address\\
    …
  }
  
  \begin{document}
  
\onecolumn %% Turn this off if single column is desired for the supplement
\maketitle

\appendix
\section{Reviewer 7Br3 -- confidence 2}
\textbf{Overall, I have a positive impression of the paper. Perhaps more details and analysis about the joint framework. For example, how do the authors deal with the difference in the dimensionality between visual and textual features?} \\

Q: Confusion about the process of multi-modal features. \\
A: Thanks for your comments. In Eq. 4, we simply adopt a linear layer with the ReLU activation function to map the visual and textual features into the same dimension. Moreover, we will release our code and dataset in the final version paper (adding the external link here is not allowed according to the rebuttal policy). 


\section{Reviewer 6kSQ -- confidence 2}
\textbf{There are aspects of the work that I did not quite follow - in particular, for example, where does the knowledge graph come from?} \\

Q: The source of the knowledge graph. \\
A: Thanks for your comments. Following the previous work \cite{SIGIR22-MEL}, our knowledge graph is extracted from the Wikidata, a knowledge base containing a large number of entity information \cite{wikidata}. Moreover, we have revised our paper carefully and corrected the typos.


\section{Reviewer X4AE -- confidence 4}
\textbf{The proposed modules and learning mechanisms are not innovative enough. For example, cross-modal attention is similar to the existing multimodal work. It is recommended to provide more details in the Supplementary Material.} \\

Q: Innovation and Details. \\
A: Thanks for your comments. In this paper, we study a new multi-mention scenario in the entity linking task and propose a joint learning framework to address this challenge. We mainly focus on how to deal with multi-mention cases efficiently and leave the innovations on cross-modal fusion to the future work. Our main technical contributions include a pairwise training scheme and a multi-mention collaborative ranking module. In detail, the contrastive learning is introduced in the pairwise training scheme to measure the correlation between two different mentions at the training stage. The multi-mention collaborative ranking module based on the similarity score is proposed to handle the cases with more than two mentions to be linked. % Although the technique of contrastive learning has been applied to various kinds of tasks, 
Here, we find that these methods can be adapted to measure the semantic correlation between candidate entities of different mentions efficiently. Moreover, we will revise our supplementary material with more implementation details. Last but not least, we have uploaded our dataset and code to the Github and will add the link in our revised paper to facilitate the further research (adding the external link here is not allowed according to the rebuttal policy). 

% out of the scope of this paper

\section{Reviewer KBKs -- confidence 3}
\textbf{The authors claim that unlike previous approaches that separately embed the context and entities, their approach jointly embeds the textual features of both the entities and the context. Can the authors comment on the impact of this on the runtimes? Typo in the related work section on entity linking - “which actually widely exit (exist)} \\

Q: Confusion about the runtime. \\
A: Thanks for your comments. We will discuss the runtime from two aspects of training and testing. During the training stage, our proposed joint learning framework can make the model train in parallel with no training time overload. \textcolor{red}{We also observe that the baseline GHMFC \cite{SIGIR22-MEL} may reach its optimal state after about 70 epochs with about 110s per epoch in the Wiki-MEL dataset, while our method only takes 11 epochs with about 204s per epoch. These results illustrate that our method can converge faster than baselines. }
During testing, the running time is positively correlated with the number of candidate entities since we need to concatenate the mention and each candidate entity to obtain the corresponding representations. Therefore, our method indeed requires more testing time. However, it is worth noting that our task focuses more on the fine ranking with the top-1 metric rather than the coarse ranking like the top-20 metric. Therefore, following \cite{wikidiverse}, when given 10 candidate entities, the baseline GHMFC \cite{SIGIR22-MEL} will use 63.2s to process the whole test set with 5,256 cases in the Wiki-MEL dataset, while our method will take 174.5s. 
% In our experiments, to be consistent with the previous works, we set the number of candidate entities to 100 and the baseline GHMFC \cite{SIGIR22-MEL} may take 200s to process the whole test set with about 5000 cases in the Wiki-MEL dataset, while our method may take 25mins. Whereas, it is worth noting that although our method requires more testing time, the difference in runtime between GHMFC and our method can be reduced by reducing the number of candidate entities, which is a feasible way since our task focuses more on the fine ranking with the top-1 metric rather than the coarse ranking like the top-20 metric. 
We will add this running time comparison in our supplementary materials in the final version. 


\section{Reviewer 7KRu -- confidence 4}
\textbf{1. The multi-mention entity linking problem has been the subject of active research. I'm not sure if the paper can claim to be the first study. See below for citations. Regarding Weakness \#1: The multi-mention entity linking problem has been the subject of active research. This is one of the major problems of interest in the entity linking community, with techniques ranging from graphical models to linear programming to pagerank. There are so many previous works on this, just to cite a few papers that all have hundreds of citations: references} \\
\textbf{2. The multimodal aspect is interesting, but I would appreciate a broader discussion on its applicability. Are we restricted to only mentions with pictures? It seems that the paper only focuses on people entities. In what applications do we have such pictures?} \\
\textbf{3. The method for "multi-mention collaborative ranking"-- Can you explain a bit more why it makes sense to first sort the mentions and then add the similarity scores for the top candidates? I don't fully follow how this procedure address multiple mentions and what exactly it is implementing (I think it ends up implementing some kind of voting by using multiple mentions' scores, but not sure about this because this is only done at test time.)} \\
\textbf{4. I am curious what kind of accuracy can be achieved if we construct a face recognition system on the image part of the data. How strong is the current image component compared with an off-the-shelf face recognition system?} \\
\textbf{5. Weakness 1 must be addressed. The entire field has been very much concerned with multiple mentions and collective inference for a decade, so the paper cannot claim to be a "first study."} \\


Q1: The wrong usage of ``first study'' about collective entity linking.\footnote{1.apologize for the insufficient investigation and recently, plenty of works have focused on the multimodal entity linking scenario... 2. customize a traditional collective entity linking method NCEL and attribute the reason to 1) NCEL fits for documents with many mentions and 2) contrastive learning.} \\
A1: Thanks for your suggestions. Firstly, we carefully investigate the works about collective entity linking in the recent decade, and apologize for the usage of ``the first study on collective entity linking'' due to our previous insufficient survey. Whereas, unlike the previous collective entity linking task that only involves the textual information, we found that the multi-mention scenario also exists in the multimodal entity linking task, which has attracted plenty of interest recently. Therefore, we are the first one to study the multi-mention entity linking under the multimodal scenario. 

Secondly, following your comments, we also customize and test a collective entity linking method--NCEL \cite{COLING18-NCEL}, which adopts a GCN to model the connection between the candidate entities of the current mention and the candidate entities of the neighbor mentions, into our datasets. It only achieves 2.1, 10.6, 21.1, 41.3 on the Wiki-MEL dataset and 8.2, 11.3, 18.4, 31.5 on the NYTimes-MEL dataset for the metrics of top-1, top-5, top-10, and top-20, respectively. We attribute the poor performance to two reasons. The first one is that traditional collective entity linking methods \cite{COLING18-NCEL, IJCAI19-CEL} aim at designing different kinds of modules to measure the connection among candidate entities of different mentions, which target the document-level linking and fit for the cases with many mentions (more than 12 mentions for each context). Whereas, the current multimodal entity linking task is only based on the sentence-level linking with no more than 5 mentions, and usually the sentence only has one mention. 
% Therefore, the traditional collective entity linking approaches do not perform well under our circumstances. 
The second one is that these collective entity linking approaches 
% pay more attention to the connection among different candidate entities, but 
ignore the negative impacts caused by negative candidate entities. Therefore, the contrastive learning in our framework, 
% aims at narrowing the semantic distance between the positive entity of mention 1 and mention 2, while enlarging the distance between the positive entity of mention 1 and the negative entity of mention 2, 
considering the correlation from positive and negative levels, can boost the linking performance to a large margin. Finally, we will revise this mistake carefully in our final version, cite the corresponding papers about collective entity linking in the related work, and add NCEL \cite{COLING18-NCEL} as the baseline of collective entity linking methods in our experiments.  

Q2: The discussion about its applicability.\footnote{our task can adapt to the cases when multiple objects co-occur in the same text and image, such as animals and buildings. It is simply easy for us to collect a PERSON dataset.} \\
A2: Thanks for your comments. First, our proposed multimodal multi-mention entity linking method is not restricted to only mentions with pictures. Actually, we also achieve the SOTA performance when only the textual information is available (see Table 2). Second, as for the sources of such pictures, there are many applications like knowledge-based visual question answering (KB-VQA) \cite{hypergraph-transformer, vqa2} and product search which can provide both visual and textual contexts. Third, our method is also not restricted to the PERSON entities and can actually adapt to multiple kinds of entities, as long as these entities appear together in the same text and image, such as buildings, animals, etc. 



%A2: Thanks for your question. Our proposed multimodal multi-mention entity linking method is not restricted to the PERSON entities and can adapt to multiple kinds of entities, as long as these entities appear together in the same text and image, such as buildings, animals, etc. %It is worth noting that we consider this multi-mention scenario as an additional situation, and our framework can 
% It is worth noting that our framework is not limited to the multi-mention scenario, and it can handle the single-mention cases as well. 
%\textcolor{red}{Moreover, in some knowledge-based visual question answering tasks (KB-VQA) \cite{hypergraph-transformer, vqa2}, both visual images and textual captions will be given to link to the correct entities and then answer the question based on these entities. Finally, our work is not restricted to only mentions with pictures. Actually, we also achieve the SOTA performance when only the textual information is given (see Table 2).}
%In addition, recent multimodal entity linking works regard the visual modality as the complementary information to benefit the feature learning \cite{SIGIR22-MEL, wikidiverse, gan2021multimodal}, and we also achieve the SOTA performance when only the textual information is given.}

Q3: Confusion about ``"multi-mention collaborative ranking"''. \\
A3: Thanks for your comments. We make a detailed clarification here. 
% During the training stage, in Section 3.4, we design a pairwise training scheme to measure the correlation between candidate entities of mention 1 (e$_1^{pos}$ and e$_1^{neg}$) and candidate entities of mention 2 (e$_2^{pos}$ and e$_2^{neg}$). Here, the ``pos'' and ``neg'' represent the positive and negative candidate entities for the corresponding mention. Specially, we first calculate the similarity scores between the above four candidate entities, and then adopt the contrastive learning to narrow the semantic similarity between e$_1^{pos}$ and e$_2^{neg}$, while enlarging the semantic dissimilarity between e$_1^{pos}$ and e$_2^{neg}$, and e$_2^{pos}$ and e$_1^{neg}$. In this way, the model can capture the potential connection (such as the same occupation, era, etc) between positive entities for the mention 1 and 2. 
\textcolor{red}{During testing, take a case with three mentions and |E| candidate entities per mention as an example, we first sort the mentions based on the matching scores (between mentions and their corresponding candidate entities) to find the mention with the highest confidence and then calculate similarity scores between candidate entities of the mention 1 (with the highest confidence) and those of the mention 2 (with the second highest confidence). For the entity candidates of mention 3, we will consider their semantic distances to the entity candidate combinations of mention 1 and 2. Therefore, we need to obtain the similarity scores for total |E|$^3$ combinations. To reduce the time cost and space complexity, we only select the top-|E| ones from the |E|$^2$ potential entity candidate combinations for mention 1 and 2 according to their similarity scores. In this way, the final top-|E| combinations of candidate entities for the mentions 1, 2, and 3, from the |E|$^3$ potential combinations, have the highest confidence. It is worth noting that we need to ensure that the selected combinations with the highest confidence, therefore, we need to sort the mentions based on the matching score at the first stage to consider the mentions from the highest to the lowest confidence.}
% In short, we hope to utilize the entities of mention 1 with higher confidence to assist in linking entities of mention 2 with lower confidence. 
We will add more details in this module to make it clearer.


Q4: The feasibility of the face recognition system. \\
A4: Thanks for your question. Nowadays, the face recognition system usually requires high-quality, clear, and frontal face images. However, the query images in our entity linking task are mainly downloaded from the website and we find that these images may have different styles (like the painting in the left part of Fig. 1), image variety (``Charlie Chaplin'' may have dissimilar image mentions in different scenarios in Fig. 1), and facial ambiguity (several human entities share the same facial information, e.g., movie stars and the characters they played). 
% mentioned in \cite{gan2021multimodal}. 
Therefore, the face recognition system may not achieve good performance. Moreover, as previous multimodal entity linking works have mentioned \cite{SIGIR22-MEL, gan2021multimodal, wikidiverse},  the visual modality is mainly considered as the complementary information in the entity linking task. Especially, Table 4 in paper \cite{wikidiverse} has shown that the performance of the model is unsatisfying when only visual information is available.


\bibliographystyle{plain}
\bibliography{reference}

\end{document}
