\section{Introduction}
\input{sections/01_introduction}

\section{Background} \label{sec:background}
\input{sections/02_background}

\section{Approach} \label{sec:methodology}
\input{sections/03_approach}

\section{Experimental Setup}
\input{sections/04_experimental_setup}

\section{Experiments} \label{sec:experiments}
\input{sections/05_experiments}

\section{Conclusion}
\input{sections/06_conclusion}

\section*{Limitations} \label{sec:limitations}
\input{sections/limitations}

\section*{Acknowledgements}
We would like to thank the anonymous reviewers for their helpful comments and feedback.
We would also like to thank Rotem Dror, Sharon Goldwater and Grzegorz Chrupała for consulting, and the native speakers that consulted and validated our annotation tool: Assaf Porat, Kozue Watanabe, Arie Cattan, and Yilin Geng.
This work was supported in part by the Israel Science Foundation (grant no. 2424/21), the Israeli Ministry of Science and Technology (grant no. 2336), and by the HUJI-UoM joint PhD program.

\section*{Ethics Statement}
We use publicly available resources in our experiments, in accordance with their license agreements. The datasets are fully anonymized and do not contain personal information about the caption annotators or any information that could reveal the identity of the photographed subjects.





\subsection{Cognitive Studies} \label{sec:cog_studies}

\textbf{Numerals.}
\emph{Subitizable} numbers are numbers that are rapidly and accurately visually counted by humans. Studies have shown that the threshold for subitizability is 4~\cite{kaufman1949discrimination, mandler1982subitizing}, with~\citet{barr2013generation} showing that humans tend to describe non-subitizable numbers using quantifiers (e.g.,  \emph{many}). In Section~\ref{sec:corpus_analysis} we confirm this result at scale.
\citet{chesney2015counts} asked participants to count objects in a given image, and found that participants were less likely to include objects located in frames (windows, mirrors or picture frames) in their count, suggesting that visual cues influence linguistic choices.

\textbf{Negation.}
Several studies have challenged the traditional view that images cannot express negation~\cite{worth1981studying, khemlani2012negation}. \citet{giora2009we} use visual negation markers (e.g., red cross road signs) to study neural processing of visual negation. \citet{oversteegen2014can} ask Dutch native speakers to describe images of objects missing integral parts (e.g., a woman without a mouth) and show that the descriptions are likely to contain a negation word.

\textbf{Passive voice.}
\citet{myachykov2012determinants} show that English native speakers have a stronger preference for using passive-voice when describing transitive events with visual cueing of their attention toward the agent (compared to the control condition). 

\textbf{Transitivity.}
\citet{rissman2019occluding} show that participants had a preference for intransitive descriptions of visual events (a person acting on an inanimate object) when the person was removed by cropping the image (whereas transitive descriptions were preferred in the base condition).

\subsection{Computational Studies}
\gabis{I prefer to put this in a related work section towards the end of the paper, let's get to our stuff as quickly as we can.}
Computational studies on how vision constrains language are rare. However, several studies examined various aspects of the linguistic properties studied in this work, typically focusing on individual properties and/or languages.

\textbf{Negation.}
A series of studies~\cite{van2016pragmatic, van2017cross}, investigated negation in Flickr30k image descriptions using a smaller set of negation words compared to our study, comparing the use of negation in English, German, and Dutch, and finding no significant differences.
\citet{dobreva2021investigating} show that the performance of vision and language models decreases when the text contains negation, but did not show that this decrease is caused by negation-related visual features. Text-only models also have difficulty processing negations~(e.g.,~\citet{ettinger2020bert}), and the drop in performance could be due to the text encoder alone.

The line of work most similar to this study train models to predict whether images from comics~\cite{sato2021visual} or real life~\cite{sato2021can} express negation, achieving chance-level results. In contrast to the current study, they used a single dataset, a single language (Japanese), and a single linguistic property (negation).

\textbf{Transitivity.}
\citet{nikolaus2019compositional} show that captioning models generalize better to unseen action -- object pairs when the action is transitive, hypothesizing that this improvement is due to the additional arguments (e.g. cake) that images describing transitive events (e.g. eating) contain.

\textbf{Verbal vs. nominal constructions.}
\citet{su2021dependency} study syntactic parsing and compare the Part-Of-Speech (POS) tag of the root of predicted and gold dependency trees of MSCOCO English captions, showing that the gold distribution is approximately 60:40 in favour of nouns, while models tend to never produce trees with a verb root.

\subsection{Languages} \label{sec:languages}

We study English (\textbf{En}), German (\textbf{De}), Chinese (\textbf{Zh}), and Japanese (\textbf{Jp}) for three main reasons. First, multiple language families are required to study language-agnostic constraints imposed by images. Second, all of these have large image-caption datasets with non-translated captions. Third, all of these have publicly available tools for annotation of some of the linguistic properties we study.

\subsection{Annotation of Linguistic Properties} \label{sec:prop_annotation}

Below, we describe the automatic annotation of occurrences of linguistic properties in captions.
All annotation methods were validated by asking in-house native speakers to verify a random sample of 100 (50 positive and 50 negative) instances per property and language. Across all languages and properties, accuracy exceeded 92\%, confirming that our automatic annotations are of high quality.

For Japanese we only study the use of numerals since we were not able to achieve accurate annotation for the other properties.

\textbf{Numerals (Num).}
We use Microsoft's Recognize-Text package\footnote{\href{https://github.com/microsoft/Recognizers-Text}{github.com/microsoft/Recognizers-Text}} to identify the use of numerals in all languages.
We ignore numerals with value of 1 for the following reasons: (1)
In German and Chinese, the same word can refer to the number \emph{one} or the determiner \emph{a};
(2) In Japanese, several non-numeral words contain the character for 1 
\begin{CJK*}{UTF8}{gbsn}
(一),
\end{CJK*}
confusing the recognizing algorithm.

\textbf{Negation words (Neg).}
We use the list of English negation words composed by~\citet{dobreva2021investigating}, and add the word \emph{nope}. We translate all words in the English list into the other languages, and verify the resulting lists with a native speaker.\footnote{All negation words are listed in Appendix~\ref{sec:app_prop_annotation}.}

\textbf{Verbal vs. nominal descriptions (Verb).}
We label captions with the root part-of-speech tag of their dependency tree, identified using Stanza's dependency parser~\cite{qi2020stanza}. We only consider captions with a single root which is a verb or a noun, filtering 0.8\% of the captions.
Note that we consider sentences where the root corresponds to the English verb \emph{to be} (\emph{sein} in German,
\begin{CJK*}{UTF8}{gbsn}
有
\end{CJK*}
in Chinese)
as noun roots, as no activity is described.

\textbf{Transitivity of main verb (Tran).}
We use Stanza's dependency parser and filter all captions with at least one of the following: (1) a non-verb root, (2) more or less than a single root, (3) the verb \emph{ be} (or its equivalents in languages other than English) as a root, filtering 47\% of the captions. After filtering, a caption is labeled as transitive if its root verb has a child labeled as a direct object, and intransitive otherwise.\footnote{In German and Chinese we automatically identify edge cases missed by the parser, see Appendix~\ref{sec:app_prop_annotation}.}

\textbf{Passive voice (Pass).}
We use the passive voice identifier tool for
English and German~\cite{ramm2017annotating}. For Chinese we search for the passive indicator
\begin{CJK*}{UTF8}{gbsn}
被,
\end{CJK*}
filtering cases where it is part of another word.\footnote{Words containing 
\begin{CJK*}{UTF8}{gbsn}
被
\end{CJK*}
are listed in Appendix~\ref{sec:app_prop_annotation}.}

\subsection{Datasets} \label{sec:datasets}

We use the following datasets: Pascal~\cite{rashtchian2010collecting}, MSCOCO~\cite{lin2014microsoft}, Flickr30k~\cite{young2014image}, Multi30k~\cite{elliott2016multi30k}, Flickr8kcn~\cite{li2016adding}, AIC-ICC~\cite{wu2017ai}, COCO-CN~\cite{li2019coco}, YJCaptions~\cite{miyazaki2016cross}, STAIR-captions~\cite{yoshikawa2017stair}. Table~\ref{tab:datasets} presents additional information.
We only use datasets with original captions generated by native speakers and avoid using datasets with captions translated from English.\footnote{See Appendix~\ref{sec:translated_captions_analysis} for a comparison of original and translated captions.} In addition to captions, MSCOCO and Flickr30k contain object classes and bounding box annotations.
A description of the data collection process for each dataset is provided in Appendix~\ref{sec:dataset_collect}.

\input{tables_and_plots/experimental_setup/datasets}

\textbf{Property expression probability.}
Given an image $I$, a set of captions $c_1, ..., c_{N_I}$, and an annotation function $f_p(c_i)$ mapping caption $c_i$ to $1$ if it expresses property $p$, and to $0$ otherwise, we define the probability that image $I$ expresses $p$ as

\begin{equation*}
P_p(I) = \frac{\sum_{i=1}^{N_I} f_p(c_i)}{N_I}
\end{equation*}

Given a language $\mathcal{L}$ we denote with  $P_{p,\mathcal{L}}$ the same computation with the set of captions filtered to include only captions in $\mathcal{L}$.
Given a set of images $S$ we denote the expected probability of expressing property $p$ across all images in $S$ as

\begin{equation*}
\mathbb{E}_{I \in S}[P_p(I)] = \frac{\sum_{I \in S} P_p(I)}{|S|}
\end{equation*}

Table~\ref{tab:prevalence} presents the expected probability of each property $p$ occurring in captions in language $\mathcal{L}$, $\mathbb{E}_{I \in S_\mathcal{L}}[P_{p,\mathcal{L}}(I)]$, where $S_\mathcal{L}$ is the set of all images in all datasets of language $\mathcal{L}$.

\input{tables_and_plots/corpus_analysis/prevalence}
\subsection{Predicting Properties from Images} \label{sec:classifiers}

We study the task of predicting, given an image, whether human annotators will use a particular linguistic property when describing it. The input is a raw image and the output is binary, indicating whether the descriptions express the property.

\textbf{Models.}
Our model consists of a visual encoder~\citep[ResNet50,][]{he_et_al_2016} to embed the raw image, followed by a set of binary SVM classifiers, one per linguistic property.\footnote{We also experimented with neural classifiers, but SVM performed significantly better: see Appendix \ref{sec:neural_classifier} for details.} We investigate four different pre-training methods with varying levels of supervision from different modalities.

First, we randomly initialize the visual encoder (no pre-training; \textbf{None}), avoiding unwanted bias through pre-training with human annotated information. Using a random encoder renders the task for the classifier more difficult, and the classifier might perform poorly even if linguistic properties are highly correlated with visual features, so we consider \textbf{None} as a lower bound.

To equip our model with some prior visual knowledge, we use MoCo~\cite{he2020momentum}, a self-supervised pre-training method based only on visual signals (\textbf{MoCo}). MoCo creates multiple manipulated versions of an image and trains the encoder to predict if two manipulated images correspond to the same original.

We also include ImageNet~\cite{deng_et_al_2009} pre-training (\textbf{IN}). The visual encoder is first trained to classify images in the ImageNet dataset, and then the classification head is discarded. Although semantic information is provided in ImageNet pre-training through class-labels, no textual input is provided which {\it describes} the visual scene.

Finally, we use \textbf{CLIP}~\cite{radford_et_al_2021} pre-training. CLIP is a multimodal self-supervised model, trained to project images and corresponding captions to similar vectors in a joint space. We use CLIP's visual encoder, discarding the text encoder. This method is pre-trained with explicit textual input, and hence its predictions will be skewed by the prior probability of linguistic properties in general language, obscuring the correlation with image features. In terms of raw performance, we consider CLIP as an upper bound.

We study two settings: monolingual (all images from datasets in a single language) and multilingual (all images from all datasets).
In each setting, for each linguistic property $p$, we compute the probability of all relevant images to express $p$ and binarize the data by using the median probability value as a threshold above which the image is considered to express $p$. Finally, we create a balanced dataset\footnote{Although balancing the test set is usually considered a bad practice, in this study we only study image-text correlation and our classifiers would not be used for classifying new samples in the future.} of images
that express $p$ and those that do not. We evaluate our models using 5-fold cross-validation.
Table~\ref{tab:generated_datasets} shows the statistics of the generated datasets (note that the size of the datasets is smaller than in Table~\ref{tab:prevalence} because the data was balanced using down sampling). Implementation details are in Appendix~\ref{sec:model_details}.

\input{tables_and_plots/predicting_linguistic_properties_from_images/generated_datasets}

\paragraph{Multilingual results.}
Results are presented in Table~\ref{tab:multilingual_results}. First, we observe that except for the model without pre-training, all models predict all properties above chance levels, supporting the hypothesis that linguistic properties are constrained by visual context.
Second, results for the two non-textual pre-training methods (MoCo, IN) were significantly higher than the lower bound (None) and lower than the upper bound (CLIP) in all properties.
Finally, numerals seem easiest to predict, which concurs with our corpus analysis where we find that mentions of numerals were easiest to link to visual properties (Section~\ref{sec:corpus_analysis}).

\paragraph{Monolingual results.}
We applied MoCo, the best performing method without human annotated pre-training, individually to each language (Table~\ref{tab:monolingual_results}). Note that model performance does not always correlate with training data size (Table~\ref{tab:generated_datasets}): in English, the verb root dataset was the largest but the classifier achieved the lowest accuracy; and prediction accuracy was high for passive voice in Chinese despite a small dataset.
Across all languages, use of numerals was predicted most reliably.

\input{tables_and_plots/predicting_linguistic_properties_from_images/multilingual_results}

\input{tables_and_plots/predicting_linguistic_properties_from_images/monolingual_results}

\subsection{Corpus Analysis} \label{sec:corpus_analysis}



In this section we show that large image captioning corpora not only allow us to build predictive models to test hypotheses about the constraints of visual properties on language, but also support large-scale corpus studies. Our goal is to correlate image properties (e.g., the type or number of objects in an image) with linguistic choice (e.g., the use of numerals). The ground truth image properties are typically unavailable, but we can use additional information in MSCOCO and Flickr30k as proxies. In particular, we use the fact that the corpora are multilingually aligned (each image contains captions in different languages, all generated by native speakers) and they contain additional annotations (class labels and bounding boxes). 

We take the expression of numerals as a test case, since it was the one most accurately predicted in Section \ref{sec:classifiers}. We emphasize, however, that the approach generalizes to other properties as well.


Although both MSCOCO and Flickr30k contain object classes and bounding box annotations, MSCOCO's granularity is much higher (80 classes compared to 10 classes), so we only use MSCOCO's class and bounding box annotations in our analysis.
German is excluded from the class and bounding box analysis as there is no German version of MSCOCO with original captions.

\paragraph{Images containing animals are most likely to be described using numerals across languages.}
For each MSCOCO class $c$, we find the set $S_c$ of all images instantiating that class and compute $\mathbb{E}_{I \in S_c}[P_{\text{num}}(I)]$. We note that the expected $P_{\text{num}}(I)$ of some classes might be lower simply because they are more likely to occur in singles, and avoid this bias by filtering out images with a single instantiation of $c$ from $S_c$.\footnote{No classes were completely filtered out; only two classes (\emph{toaster}, \emph{hair-drier}) were left with less than 80 images.}

Figure~\ref{fig:num_coco_classes} shows the 5 classes with the highest and lowest $\mathbb{E}_{I \in S_c}[P_{\text{num}}(I)]$ for each language. In all languages, images depicting animals are most likely to be described with numerals.

\begin{figure} [tb]
    \centering
    \includegraphics[width=8cm]{figures/data_analysis/numbers/coco_classes.pdf}
    \caption{MSCOCO classes with highest and lowest expected numeral expression probability $\mathbb{E}_{I \in S_c} [P_{\text{num}, \mathcal{L}}(I)]$, for
    $\mathcal{L} \in \{ \text{En}, \text{Zh}, \text{Jp} \}$.
    The probability of classes of animals is high in all languages.}
    \label{fig:num_coco_classes}
\end{figure}

\paragraph{Our findings corroborate cognitive findings, placing the human subitizability threshold at 4.}
We use MSCOCO bounding boxes annotation to investigate whether the use of numerals in image descriptions reflects the subitizability threshold (see Section~\ref{sec:cog_studies}).
For each integer $k$, we find the set $S_k$ of all images with $k$ labeled bounding boxes, and compute $\mathbb{E}_{I \in S_k}[P_{\text{num}}(I)]$. We also label captions with quantifiers (e.g., \emph{some, a few}\footnote{The full lists are in Appendix~\ref{sec:app_prop_annotation}.}) and compute $\mathbb{E}_{I \in S_k}[P_{\text{quant}}(I)]$.
Figure~\ref{fig:num_coco_bboxes} shows the results, for all $k$ where $|S_k| \geq 100$.
In all languages, $\mathbb{E}_{I \in S_k}[P_{\text{num}}(I)]$ initially increases with a clear peak at 4, while quantifiers expression steadily increases.

\begin{figure}[tb]
    \centering
    \includegraphics[width=8cm]{figures/data_analysis/numbers/coco_bboxes.pdf}
    \caption{Expected probability of expressing the use of numerals and quantifiers $\mathbb{E}_{I \in S_k} [P_{p, \mathcal{L}}(I)]$ as a function of the number of bounding boxes in MSCOCO, for
    $\mathcal{L} \in \{ \text{En}, \text{Zh}, \text{Jp} \}$ and $p \in \{ \text{num}, \text{quant} \}$.
    All $k$ with $|S_k| < 100$ were removed from the plot. Red line: subitizability threshold. In all languages, the probability increases up to 4 objects (consistent with cognitive studies) and then decreases. Quantifiers expression probability increases steadily.}
    \label{fig:num_coco_bboxes}
\end{figure}

\paragraph{Captions of the same image in different languages tend to agree on numerals usage.}
We use the multilingual datasets Flickr30k (En, De, Zh) and MSCOCO
(En, Zh, Jp), identify a list of images with captions in all respective languages $\{ I_k \}_{k=1}^{N}$, and compute the list of probabilities of numerals expression for each image $L_{\mathcal{L}} = \{ P_{\text{num}, \mathcal{L}}(I_k) \}_{k=1}^{N}$, in each language $\mathcal{L}$. Next, we compute the Pearson correlation coefficient of $L_{\mathcal{L}_1},L_{\mathcal{L}_2}$ for each pair of languages $\mathcal{L}_1,\mathcal{L}_2$. The results are shown in Table~\ref{tab:num_agreement}. The correlation is {\it high} ($>0.5$;  \citet{cohen2013statistical}) across all languages and datasets.

\input{tables_and_plots/corpus_analysis/num_agreement}

\subsection{Additional Insights} \label{sec:insights}

The proposed methodology can also be used as an exploration method for further cognitive research. In this section, we present findings obtained by manually investigating extreme cases of property expression. This is an exploratory analysis, presenting preliminary findings that may lead to future research in a more controlled setting.

\textbf{Use of numeral expressions.}
We manually inspect all images that use numerals in all captions across all languages in Flickr30k (N=105). The top images in Figure~\ref{fig:cons_high_images} are representative examples. All images depict multiple participants taking similar roles and positioned in a regular pattern (e.g., all the children in the upper right image in Figure~\ref{fig:cons_high_images} are swinging and facing the camera). 
The bottom of Figure~\ref{fig:cons_high_images} shows comparable images, which were never described using numerals. Here, participants appear in different poses and roles. 
We hypothesize that people count more easily and accurately when objects are arranged in a regular pattern, compared to a random formation~\cite{burgess1983precision}. 

\begin{figure}
    \centering
    \includegraphics[width=6cm]{figures/data_analysis/numbers/cons_high_and_low_images.pdf}
    \caption{Top: images described using numerals in all languages. Bottom: images described without numerals. Images taken from Flickr30k.}
    \label{fig:cons_high_images}
\end{figure}

We also present differences in the use of numerals across languages.
We analyze images for which at least two captions use numerals with the same numeral value in each language, but different values across languages (N=46).
We find two main reasons for cross-language inconsistencies: First, different languages tend to either count all participants in a single group or split them into smaller groups based on gender, role, or age.\footnote{Examples for all partition types are in Appendix~\ref{sec:app_visual_examples}.}
These differences may be due to different annotation guidelines or different cultural backgrounds of the annotators.

Second, the multilingual datasets were originally created for English captioning, making the selected images highly related to English and especially North American culture.\footnote{This is a well known problem in multimodal datasets, previously discussed by~\citet{liu2021visually}.}
For example, in the sports domain,
the datasets contain mainly images of Basketball and Baseball, popular sports in the United States. While English annotators use a detailed description, commonly mentioning the players' shirt number, German and Chinese descriptions are mostly short and count the number of players in the image (Figure~\ref{fig:sports_dis}).

\begin{figure} [tb]
\centering
\begin{small}
\begin{tabular}{p{5cm}l}
{\bf En:} Basketball player wearing a
white, number 23 jersey jumps up
with the ball while guarded by
number 13 on the opposite team & \multirow{5}{*}{\includegraphics[width=0.12\textwidth,clip,trim=23cm 2cm 0 0]{figures/data_analysis/numbers/disagreement/sports_disagreement.pdf}}\\
{\bf De: } Zwei Männer spielen Basketball\\
{\bf Zh:} \begin{CJK*}{UTF8}{gbsn}
有两个男人正在打篮球
\end{CJK*}
\end{tabular}
\end{small}
    \caption{An image of a basketball game. The English captions are highly detailed, while both the German and Chinese caption translates to {\it Two men are playing basketball}. Image taken from Flickr30k, captions taken from Flickr30k (En), Mutli30k (De), Flickr8kcn (Zh).}
    \label{fig:sports_dis}
\end{figure}

\textbf{Passive.}
\label{ssec:corpus_passive}
We notice that in images with high probability for using passive voice, the patient is commonly located at the center of the scene either by the pose of the camera or the borders of the image. We hypothesize that this visual feature is correlated with the use of passive voice. The right image of Figure~\ref{fig:intro_image} shows one example. More examples are in Appendix~\ref{sec:app_visual_examples}.


\subsection{Discussion}

Our experiments suggest that various linguistic properties are predictable from visual context, most notably in the case of the use of numeral expressions.
Our classifiers were able to predict the presence of numerals in captions with high accuracy. Correspondingly, our corpus analysis provides evidence that the type and number of objects in the image constrain the use of numerals. Both results hold across different languages, and present high agreement between languages in the selection of images that are described with numerals. This lends support to the hypothesis that visual context constrains the use of numerals across a variety of languages from different families, and that such trends can be studied using the proposed methodology.

A surprising result is that without pretraining of the visual encoder (\textbf{None}), above chance-level performance can be obtained, most notably for the numerals property. A randomly initialized visual encoder applies a random dimensionality reduction to the input image, and the fact that the SVM classifier was able to learn to predict the presence of numerals in the captions of images at above chance level following this random transformation supports the hypothesis that this property correlates with visual features.




\section{Implementation Details}
\label{sec:impl_details}

\subsection{Linguistic Properties Annotation} \label{sec:app_prop_annotation}

\paragraph{Use of numerals}
In the bounding boxes experiment in section \ref{sec:corpus_analysis} we search for quantifiers. Following are the lists of quantifiers we search for in each language. English: \emph{some, a lot of, many, lots of, a few, several, a number of}. Chinese:
\begin{CJK*}{UTF8}{gbsn}
些, 多.
\end{CJK*}
Japanese:
\begin{CJK*}{UTF8}{gbsn}
多くの, たくさん, いくつか.
\end{CJK*}

\paragraph{Use of negation words}
Following are the lists of negation words for each language.

English: \emph{not, isn't, aren't, doesn't, don't, can't, cannot, shouldn't, wont, wouldn't, no, none, nobody, nothing, nowhere, neither, nor, never, without, nope}.

German: \emph{nicht, kein, nie, niemals, niemand, nirgendwo, nirgendwohin, nirgends, weder, ohne, nein, nichts, nee}. We lemmatize the words in the sentence before searching in this list.

Chinese:
\begin{CJK*}{UTF8}{gbsn}
不, 不是, 不能, 不可以, 没, 没有, 没什么, 从不, 并不, 从没有, 并没有, 无人, 无处, 无, 别, 绝不.
\end{CJK*}
We use the Jieba tokenizer\footnote{\href{https://github.com/fxsjy/jieba}{github.com/fxsjy/jieba}}. We also identify cases where one of the words above is part of a longer non-negation word and filter those cases. Following is the list of non-negation words:
\begin{CJK*}{UTF8}{gbsn}
别着, 不小心, 不一样.
\end{CJK*}

\paragraph{Use of passive verbs}
In Chinese we search for the passive indicator
\begin{CJK*}{UTF8}{gbsn}
被,
\end{CJK*}
filtering cases where it is part of the
\begin{CJK*}{UTF8}{gbsn}
被子
\end{CJK*}
word (meaning quilt), a common word in the AIC-ICC dataset.

\paragraph{Transitivity}
In German and Chinese we identify several important edge cases in which the Stanza parser is consistently incorrect, which we fix manually. All edge cases were verified by native speakers.

In German we identify sentences containing a node which is a child of the root and labeled with the \emph{PTKVZ} POS tag, and label these as intransitive.

In Chinese we identify sentences where (1) the lemma of the root word ends with the preposition token
\begin{CJK*}{UTF8}{gbsn}
在;
\end{CJK*}
(2) the lemma of the word following the root word is
\begin{CJK*}{UTF8}{gbsn}
在;
\end{CJK*}
or (3) the lemma of the word following the root word starts with the preposition token
\begin{CJK*}{UTF8}{gbsn}
向,
\end{CJK*}
and label these as intransitive.

\subsection{Model Details} \label{sec:model_details}

\subsubsection{SVM Classifier}

We use the SVC model from the sklearn Python package with the RBF kernel and default hyper-parameters.

\subsubsection{Neural Classifier} \label{sec:neural_classifier}

We use a feed-forward neural network with 1 or 2 hidden layers, with different activation functions (ReLU, Sigmoid, Tanh). In all configurations, the SVM classifier performed better.

\subsubsection{Pre-trained Backbone Models}

For MoCo and CLIP we use the models provided in the officially published code. For ImageNet pre-training we use the pre-trained model provided by the PyTorch package. In all cases, model contains 25.6M parameters.

\subsection{Training}

Training with the largest training set (the transitivity multilingual setting, see table \ref{tab:generated_datasets}) took 30 hours on a single GM204GL GPU.

\section{Dataset Collection Details} \label{sec:dataset_collection}
\label{sec:dataset_collect}

Following is a brief description of the process of data collection for each of the datasets.

\textbf{Pascal Sentences} \cite{rashtchian2010collecting} contains the set of images from the PASCAL Visual Object Classes Challenge \cite{everingham2008pascal} with captions generated by Amazon's Mechanical Turk workers. The annotators were instructed to (1) describe the image in a single sentence including the main characters, the setting or the relation of the objects; (2) If possible, include adjectives such as colors, spacing, emotion, or quantity; (3) pay attention to grammar and spelling.


\textbf{Flickr30k} \cite{young2014image} is a large English image-caption dataset. The objects in each image are segmented using bounding boxes and classified into one of 10 classes. Annotators were crowdsource workers and were asked to ``write sentences that describe the depicted scenes, situations, events and entities (people, animals, other objects)''.

Multi30k \cite{elliott2016multi30k} is a German version of Flickr30k. It contains both original and translated captions. Translations are generated by professional translators, original captions were generated by crowdworkers via the Crowdflower platform. Instructions were translated from the English instructions of Flickr30k.

Flickr8kcn \cite{li2016adding} is a Chinese version of the smaller Flickr8k dataset on which the Flickr30k dataset was based. Descriptions were generated by crowdworkers that were asked to ``write sentences describing salient objects and scenes in every image, from their own point of views''.

\textbf{MSCOCO} \cite{lin2014microsoft} is another large English image-caption dataset with additional annotations (object classes and bounding boxes). The captions were generated using human subjects on Amazon's Mechanical Turk. The annotators were given the following instructions:
\begin{itemize}
    \item Describe all the important parts of the scene.
    \item Do not start the sentences with ``There is''.
    \item Do not describe unimportant details.
    \item Do not describe things that might have happened in the future or past.
    \item Do not describe what a person might say.
    \item Do not give people proper names.
    \item The sentences should contain at least 8 words.
\end{itemize}

COCO-CN \cite{li2019coco} is a Chinese version of MSCOCO, annotated by a group of volunteers and paid undergraduate students. Annotators were instructed that the caption shall cover the main objects, actions and scene in a given image, and were provided with suggested captions retrieved in the following process: all the captions in the original MSCOCO dataset were machine-translated to Chinese, and the 5 most relevant suggestions for each image were chosen by a model. However, they were asked to provide their own descriptions, and only draw inspiration from the suggestions. In addition, they manually translated 5000 captions.

YJCaptions \cite{miyazaki2016cross} is a Japanese version of MSCOCO. Captions were generated using Yahoo! crowdsourcing, where signing up requires a Japanese proficiency, leading the authors to assume that participants were fluent in Japanese. Annotation guidelines can be translated to English as ``Please explain the image using 16 or more Japanese characters. Write a single sentence as if you were writing an example sentence to be included in a textbook for learning Japanese. Describe all the important parts of the scene; do not describe unimportant details. Use correct punctuation. Write a single sentence, not multiple sentences or a phrase''.

STAIR-captions \cite{yoshikawa2017stair} is another Japanese version of MSCOCO. Annotation guidelines can be translated to English as ``(1) A caption must contain more than 15 letters. (2) A caption must follow the da/dearu style (one of the writing styles in Japanese). (3) A caption must describe only what is happening in an image and the things displayed therein. (4) A caption must be a single sentence. (5) A caption must not include emotions or opinions about the image''.


\textbf{AIC-ICC} \cite{wu2017ai} is a large Chinese image--caption dataset. The annotators were instructed to (1) include key objects/attributes, locations and human actions; (2) generate fluent captions; (3) use Chinese idioms or descriptive adjectives.

\section{Additional Visual Examples} \label{sec:app_visual_examples}

\paragraph{Numerals disagreement}
Further to the numerals disagreement analysis in Section~\ref{sec:insights}, we present examples of images that were described by captions in multiple languages with numeral value disagreement caused by differences in partition of the participants. For each of these images, the captions in one language do not partition the participants while the captions in the other is partitioning based on gender (Figure~\ref{fig:part_dis1a}), role (Figure~\ref{fig:part_dis1b}) or age (Figure~\ref{fig:part_dis2}).

\begin{figure} [tb]
    \centering
    \includegraphics[width=7cm]{figures/data_analysis/numbers/disagreement/appendix/3272071680.jpg}
    \caption{An image taken from Flickr30k.
    The English caption splits participants based on gender: ``A man in a beret and thin mustache gestures to two women in conversation''.
    The Chinese caption does not split participants at all:
    \begin{CJK*}{UTF8}{gbsn}
    ``三个人正在谈话''
    \end{CJK*}
    (Three people are talking).}
    \label{fig:part_dis1a}
\end{figure}

\begin{figure} [!tb]
    \centering
    \includegraphics[width=7cm]{figures/data_analysis/numbers/disagreement/appendix/364213568.jpg}
    \caption{An image taken from Flickr30k.
    The English caption splits participants based on role: ``One dog is chasing another dog that is carrying something in its mouth along the beach''.
    The German caption does not split participants at all:
    ``Zwei weiß-braune Hunde, die am Strand laufe''
    (Two white and brown dogs running on the beach).}
    \label{fig:part_dis1b}
\end{figure}

\begin{figure} [tb]
    \centering
    \includegraphics[width=5cm]{figures/data_analysis/numbers/disagreement/appendix/2694426634.jpg}
    \caption{An image taken from Flickr30k.
    The English caption splits participants based on age: ``A man and two children in life jackets in a boat on a lake''.
    The Chinese caption does not split participants at all:
    \begin{CJK*}{UTF8}{gbsn}
    ``坐在船上出海的三个人''
    \end{CJK*}
    (Three people on a boat going out to the sea).}
    \label{fig:part_dis2}
\end{figure}

\paragraph{Passive voice}
Figure \ref{fig:all_passive} shows three images with high probability for the use of passive voice. In the upper right image the passive participant is centered by the pose of the camera, while in the other two images the borders of the image locates the passive participant in the center.

\begin{figure} [tb]
    \centering
    \includegraphics[width=7cm]{figures/data_analysis/passive/all_passive.jpg}
    \caption{images with high probability for the use of passive voice. In all images, the passive participant is centered by the pose of the camera or the borders of the image.}
    \label{fig:all_passive}
\end{figure}

\section{Original vs. Translated Captions} 
\label{sec:translated_captions_analysis}

When studying multimodal tasks in non-English languages (e.g., multimodal machine translation~\cite{hitschler2016multimodal}, visual question answering~\cite{gupta2020unified}), it is common to translate an existing English image-caption corpus into the target language using crowd sourcing or translation APIs. We show that captions generated in this setting are not representative of the target language.
We use the Multi30k dataset (De) and the COCO-CN dataset (Zh), both of which
contain original as well as translated captions in the target language.
We use the statistical method described in Section~\ref{sec:corpus_analysis} in the Cross-lingual analysis paragraph to compute the agreement of English and translated captions, and compare it with the agreement of original and translated captions.
As shown in Figure~\ref{fig:translated_agreement}, in 9/10 cases the English-Translated agreement is higher than Original-Translated agreement, suggesting that translated captions are not representative of the target language. The effect is most pronounced with negation.

\begin{figure} [tb]
\centering
\input{tables_and_plots/corpus_analysis/translated_agreement}
\caption{English -- Translated agreement (En-De$_T$ and En-Zh$_T$) and Original -- Translated agreement (De$_O$-De$_T$ and Zh$_O$-Zh$_T$) for German and Chinese, for different linguistic properties.}
\label{fig:translated_agreement}
\end{figure}








