\section{Introduction}

One of the first approaches to make mechanical drawings can be traced back several decades to \cite{mccorduck1991aaron}. Since then, roboticists have studied a variety of interesting approaches for robot-based drawings. Recent developments in generative models such as Generative Adversarial Networks (GAN), Vision Transformers (ViT), and Diffusion models have further spurred this research direction with their ever-improving quality of image synthesis and understanding of semantics. 



Existing style transfer and image-to-image translation techniques treat the process as concerted mapping of pixels \cite{mccorduck1991aaron,CycleGAN2017} or  continuous pixel space optimization \cite{style_transfer_cvpr2016}. However, painting is an expression of ideas and emotions of an artist through a visual language. It is a dynamic approach that begins with a vague notion of an artist and evolves dynamically during the course of painting, producing a picture that satisfies the artist's semantic and high-level objectives~\cite{hertzmann2022algorithmicPainting}. This procedure is organically in contrast to how deep learning methods have been used to create artwork. 
To create an approach that incorporates the dynamics of the process of painting an image, \cite{frida2022} proposed a Framework and Robotics Initiative for Developing Arts (FRIDA). FRIDA aims to give robots the ability to paint by taking in inputs in the form of images, texts, and sketches to produce an artistic visualisation that is consistent with the painter's intent. They accomplish this by emulating the painting as a planning problem, with the canvas as the state space and the brush strokes as the available actions. This gives the robot the ability to combine content creation with action planning and adhere to a continuous content optimization strategy. 

However, using sketch and text inputs have numerous limitations when it comes to applying more abstract semantics into an image. Specifically, it can be challenging to fully express the complexity and nuances of sounds and emotions using something discrete such as text. For instance, laughter is a complex sound denoted by unique loudness and rhythmic properties. While people can hear the differences in laughter easily, it can be challenging to describe them in language. Moreover, art has traditionally served as a powerful medium for expressing and evoking emotions. Throughout history, art has been utilized as a tool to communicate human feelings and experiences. Studies have shown that different visual elements such as color, shape, and composition can evoke different emotions in viewers \cite{color_emo_relationship,achlioptas2021artemis}. Emotions form an integral part of the painting process, and it is crucial to understand the relationship between visual art and emotions to capture the painter's actual intent. By focusing on sounds and the emotions behind it, we aim to initiate a more nuanced perceptual understanding of the painting process, which, downstream, can also be applied to richer understanding of ordinary paintings.

In this context, we propose an approach to use the sound and speech input modality for painting, known here as \textit{robot synesthesia}--synesthesia is a perceptual phenomenon in which a person may perceive visuals when listening to sounds.
We extend the FRIDA framework by introducing an additional intuitive and powerful image generation planning method driven by sound and speech semantics. For general natural sound guidance, as  highlighted in \cite{sound_guided_cvpr}, we leverage the approach of encoding images  into a latent representation using $CLIP$ \cite{radford2021-clip} and $CLIP_{Audio}$ for audio~\cite{sound_guided_cvpr}.
Using gradient descent, the simulated painting is altered to decrease the difference between these encodings.



\begin{figure*}[ht]
    \centering
   
    \includegraphics[width=\textwidth]{figures/method.png}\vspace{-7pt}
    \caption{Following FRIDA~\protect\cite{frida2022} human users specify their artistic intentions via text, styles, and sketches. We add audio as an input in this paper. These intentions influence the planning in simulation which is carried out via a Rethink Sawyer robot. The painting process is monitored via camera perception and the plan can be modified during the painting process. }
    \label{fig:approach}
\end{figure*}

We treat speech as a special case of sounds differentiated from natural sounds by having content specified in language. We incorporate transcription and speech emotion recognition modules to enhance the understanding of the audio input. 
The transcribed text is encoded into the same latent space as the painting and, just as with audio, we form a loss function by comparing the encodings. Emotion (e.g., anger or excitement) is predicted from the input speech. We use a pretrained image2emotion model~\cite{szymczak2022emotionPrediction} to predict the emotion of the generated painting and form a loss function between that predicted emotion and the speech's emotion.

While other sound-guided image manipulation methods are constrained by the dataset used to train their generator model~\cite{sound_guided_cvpr}, FRIDA does not rely on a pretrained generator and can, therefore, use sound guidance with arbitrary images.

Our contributions are as follows:
\begin{enumerate}
    \item A method for sound guided image manipulation with general content in the painting domain;
    \item A novel speech-to-painting method with speech decoupled into content via text and mood via speech emotion; and 
    \item Our code is open sourced and freely available \url{https://github.com/pschaldenbrand/Frida}
\end{enumerate}



\section{Related Work}

\subsection{Robot Painting}

Most prior work on robot painting and stroke-based rendering formalize their tasks as rendering a given image using given primitives, materials, and tools~\cite{huang2019learningToPaint,schaldenbrand2021contentMaskedLoss,singh2021intelli,hertzmann1998painterly}.  Painting, as a form of art, involves creatively expressing a message, and this aspect extends it as a high-level task than simply printing an image with challenging constraints. Proper robot painting involves achieving an artists high-level goals. To this end FRIDA~\cite{frida2022} introduced multiple inputs to the system in addition to a reference image to allow further expression of the human user of the system.

\subsection{Sound-Image Encoding}

Encoding sounds and images into the same latent space in a differentiable manner allows for comparison of the two modalities that can be exploited such that one of the modalities can be altered to match the other. \cite{sound_guided_cvpr} train a sound encoder, $CLIP_{Audio}$, to encode sound and images into the same latent space.
$CLIP_{audio}$ was trained on the VGG-Sound Dataset~\cite{chen2020vggsound} which contains over 200k clips for 309 different classes.  The dataset contains 200,000 10-second audio clips from each class captured from YouTube videos, with no more than two clips per video. The dataset's sound categories can be broadly divided into: people, animals, music, sports, nature, vehicles, homes, and tools, among others. 







\begin{figure}[t]
    \centering
   
    \includegraphics[width=\columnwidth]{figures/just_audio_column.png}\vspace{-7pt}
    \caption{Paintings generated using sounds described in labels above each painting.}
    \label{fig:just_audio}
\end{figure}

\begin{figure}[ht]
    \centering
    \small
    \includegraphics[width=\columnwidth]{female_singing_frida.png}\vspace{-7pt}
    \caption{Painting various examples from the VGG-Sound~\protect\cite{chen2020vggsound} categorized as ``female singing" from just audio versus using a description of the audio.  A description of the sound used and a still from the video that it was captured from are displayed left of each painting.}
    \label{fig:female_singing}
\end{figure}
 


\subsection{Sound-Guided Image Manipulation}

There is a growing body of research on cross-modal generative models that use audio samples for guidance based tasks. The previous works in this field have mostly focused on music rather than sound semantics and follow a cross-modal learning strategy for style transfer from music to image\cite{lee_crossing_in_style_2020}. There also have been works that map music embeddings to visual embedding-space using Style-GAN models \cite{traumer_2021}. 

In recent studies, there has been more interest in the semantics of the sound for navigational direction in the latent space of Style-GAN models.
\cite{sound_guided_cvpr} demonstrate that the latent space of a pretrained image synthesis model, StyleGAN2, can be manipulated such that the original generated image fits a given audio sample better. They use layerwise masking to keep compact content information within style latent code. This edited latent code is then passed into the StyleGAN2 generator to obtain the modified image. This follows the StyleCLIP~\cite{styleclip_2021} methodology for sound-guided image manipulation.
This is done through sound guidance in which the generated image and input sound are encoded into the same latent space. The comparison of these latent spaces forms a loss function that can be back propagated. Stochastic gradient descent alters the latent input to produce images that are more aligned with the input sound. 

While there has been success in using this methodology for image manipulation, previous work fails to create generalizable output images fitting the sound descriptions because they utilize pre-trained image synthesis models that can only generate images within their training distribution such as churches, faces, or artwork. Therefore, these prior works use sound only to manipulate a particular image that the synthesis model is capable of producing. In this work, our approach creates in a more general content domain and can generate images purely from audio inputs.


\section{Background}
 The proposed approach builds on the FRIDA robotic painting system~\cite{frida2022} to incorporate sounds and emotions as inputs to control the appearance and content of the generated painting.  In FRIDA's painting environment, brush strokes form the action space which are parameterized by values representing stroke length, bend, and thickness as well as location, orientation, and color of the stroke. Given a set of brush stroke actions and a picture of the canvas, FRIDA's simulation environment can differentiably render the strokes and layer them onto the canvas image forming  a simulated painting represented as an RGB image.  This predicted appearance of the painting can be used in loss functions, and because the renderer is differentiable, the brush stroke parameters can be optimized using gradient descent to decrease this loss.  
 
 \section{Approach}
 
 Our approach, depicted in Figure~\ref{fig:approach}, adds sound input to the FRIDA system. Because sounds have vastly different categories, we create approaches for two categories of sound: 1) natural sounds, which are diverse sound samples from various sources, and 2) speech sounds, which are a specific subset of natural sounds denoted by language and tone.

 \subsection{Natural Sounds Guidance}

 For natural sound guidance, we use the pretrained $CLIP_{audio}$~\cite{sound_guided_cvpr} audio-image encoder trained on a wide variety of labelled sounds from YouTube videos.
 We encode the simulated painting, $p$, and input audio, $a$, into the same latent space using $CLIP$~\cite{radford2021-clip} and $CLIP_{audio}$ respectively. The encodings are compared using cosine distance to form a loss function. In practice, the painting is augmented using various perspective warps and cropping, as is customary in CLIP-guided image synthesis, for robust loss backpropagation.

 

\begin{equation}
 \begin{aligned}
 f_a &= CLIP_{Audio}(a); &f_p = CLIP(p)
\end{aligned}
\end{equation}
\begin{equation}
 \begin{aligned}
L_{\text {NS }}(p, a) &= \cos ( f_p,  f_a)\\
\end{aligned}
\end{equation}

 We note that this method is also used for sound-guided image manipulation where an additional input image is added as an initial context. 

\subsection{Speech Guidance}

Speech is an special subset of sound with two distinct features: Language and tone. Our approach to speech guided painting decouples speech into text and emotions. This is in contrast to our natural sound guidance approach which is limited with speech inputs as the $CLIP_{Audio}$ model cannot comprehend the language used in the speech.

The input speech is transcribed to text using the Whisper~\cite{radford2022Whisper} model.  This text can then guide the painting using FRIDA's existing text-to-painting methodology which encodes both the text and simulated painting into the same latent space using CLIP and compares them with cosine distance as a loss function as seen in the first term of Equation~\ref{eq:loss_speech}.


\begin{equation}
 \begin{aligned}
 f_t &= CLIP(t); &f_p = CLIP(p) \\
 e_p &= E_{img}(p); &e_s = E_{speech}(a) \\
\end{aligned}
\label{eq:feat_extract}
\end{equation}
\begin{equation}
 \begin{aligned}
L_{\text {S }}(p, a) &= \cos ( f_p,  f_t) + \cos(e_s, e_p)\\
\end{aligned}
\label{eq:loss_speech}
\end{equation}

\begin{figure*}
    \centering
    \small
    \includegraphics[width=\textwidth,height=6cm]{emotion_grid.png}\vspace{-7pt}
    \caption{An image grid depicting the results of emotion guidance with varying guidance strengths. From left to right, each column showcases the progression of the synthesized paintings, with increasing levels of emotion guidance strength, as indicated by the numbers.}
    \label{fig:emotion_grid}
\end{figure*}

\begin{table}[]
\footnotesize
\begin{tabular}{@{}llllll@{}}
\toprule
ArtEmis                                                  & Ravdess  & Crema & TESS     & SAV & IEMOCAP                                                  \\ \midrule
amusement                                                & happy    & HAP   & happy    &     & hap, exc                                                 \\
anger                                                    & angry    & ANG   & angry    & a   & fru, ang                                                 \\
awe                                                      &          &       &          &     &                                                          \\
contentment                                              & calm     &       &          &     &                                                          \\
disgust                                                  & disgust  & DIS   & disgust  & d   & dis                                                      \\
excitement                                               & surprise &       & surprise & su  & sur                                                      \\
fear                                                     & fear     & FEA   & fear     & f   & fea                                                      \\
sadness                                                  & sad      & SAD   & sad      & sa  & sad                                                      \\
\begin{tabular}[c]{@{}l@{}}something\\ else\end{tabular} & neutral  & NEU   & neutral  & n   & \begin{tabular}[c]{@{}l@{}}neu, xxx, \\ oth\end{tabular} \\ \bottomrule
\end{tabular}\vspace{-7pt}
\caption{Correspondence between emotion labels in different datasets. From left to right: ~\protect\cite{achlioptas2021artemis}\protect\cite{livingstone2018ravdess}\protect\cite{cao2014cremaD}\protect\cite{dupuis2010tess}\protect\cite{jackson2014surrey}\protect\cite{busso2008iemocap}}
\label{tab:emotions}
\end{table}


We adapt an existing framework for Speech Emotion Recognition (SER) trained on four datasets containing speech with labelled emotions~\cite{burnwal2020-speechEmotionKaggle}.  Features such as the Mel Frequency Cepstral Coefficients (MFCC), Mel Spectogram, and Chromagram are extracted from the input speech waveform audio. A convolutional neural network is trained to predict the emotion. For robustness, we incorporate a fifth dataset for training; The Interactive Emotional Dyadic Motion Capture (IEMOCAP)~\cite{busso2008iemocap} dataset containing over 12 hours of audio-visual data of people acting along with emotions annotated. Not all of the datasets contained the same set of emotion labels, and some datasets used different terminology for similar emotions. We summarized how we dealt with these emotional label correspondences in Table~\ref{tab:emotions}.

We guide the painting to have an emotional appearance with the second term of the loss function detailed in Equation~\ref{eq:loss_speech}. A pre-trained image-to-emotion prediction model~\cite{szymczak2022emotionPrediction} ($E_{img}$) predicts the emotion of the simulated painting then compares this to the emotion predicted from the input speech($E_{speech}$). The image-to-emotion model was trained on the ArtEmis dataset~\cite{achlioptas2021artemis} which contains 80k artworks with labeled emotions. We use gradient descent to change the brush strokes such that predicted emotion from the painting is similar to the given speech's predicted emotion.

\subsection{Multi-Modal Guidance}


\begin{equation}
\begin{aligned}
&\ddot{p}=\min _p \left[ L_{\{NS|S\}}(p, a) +  \sum_{i=1}^4\left(w_i l_i\right) \right] \label{eq:objective}\\
\end{aligned}
\end{equation}

FRIDA makes painting plans by optimizing the brush stroke parameters to achieve an objective which is made up of the weighted sum of loss functions. Equation~\ref{eq:objective} displays FRIDA's objective function with our two losses $L_{NS}$ and $L_S$ included for natural sound and speech guidance respectively.
The FRIDA paper introduces 4 other loss functions, $l_i$, that connect the paintings to modalities text, images, sketches, and styles.
The full objective in Equation~\ref{eq:objective} is to  find the painting plan, $p$, that minimizes the weighted sum of losses, given the weights for each loss function, $w_i$. The optimal painting plan $\ddot{p}$ is the plan that minimizes the weighted sum of loss functions.



 
 

 






\begin{figure}[t]
    \centering
    \includegraphics[width=\columnwidth]{figure3.png}
    \caption{Sound-guided image manipulation. The figure shows the source images (top row), their paintings drawn by FRIDA (middle row) and the paintings drawn by FRIDA with the proposed sound semantics (bottom row) }
    \label{fig:audio_edits}
\end{figure}

\section{Results}

\subsection{Natural Sounds Guidance}

We introduce sound-semantics to the FRIDA framework and demonstrate how it can be used to produce a variety of contemporary paintings, offering the user greater freedom and control. Using just audio as guidance in Figure~\ref{fig:just_audio}, our approach is capable of generating paintings that capture various natural sounds.

We conducted a survey to quantify the results of Figure~\ref{fig:just_audio} and measure the correlation between the input audio and the generated painting. In the survey, participants listened to one of the six audio samples used to generate the images in Figure~\ref{fig:just_audio}, then were shown the six paintings and asked to ``select the image below that looks like it was created from the above sound".  A random selection would result in 16.7\% accuracy as there were six options; however, participants selected the correct painting 43.3\% of the time.  We conducted the survey through Amazon Mechanical Turk recruiting 28 participants. Each audio was evaluated by five different participants. The confusion matrix can be seen in Table~\ref{tab:confusion_matrix}. Many of the sounds were similar and this can be reflected with the frequent confusion of thunder, thunderstorm, and raining sounds.

\begin{table}[]
\centering
\small
\begin{tabular}{rcccccc}
\textbf{}    & \rotatebox[origin=l]{90}{Drill} & \rotatebox[origin=l]{90}{Explosion} & \rotatebox[origin=l]{90}{Fire} & \rotatebox[origin=l]{90}{Raining} & \rotatebox[origin=l]{90}{Thunder} & \rotatebox[origin=l]{90}{Thunderstorm} \\
Drill        & \textbf{3} & 1 & 0 & 0 & 0 & 1  \\
Explosion    & 0 & \textbf{2} & 1 & 0 & 2 & 0  \\
Fire         & 0 & 0 & \textbf{2} & 1 & 2 & 0  \\
Raining      & 0 & 1 & 0 & \textbf{1} & 1 & 2  \\
Thunder      & 2 & 0 & 0 & 2 & \textbf{0} & 1  \\
Thunderstorm & 0 & 0 & 0 & 0 & 0 & \textbf{5} 
\end{tabular}
\caption{The confusion matrix from our survey where participants listened to audio then decided which of the six paintings in Figure~\ref{fig:just_audio} was generated using that audio. Rows display true labels and columns show the predicted labels.}
\label{tab:confusion_matrix}
\end{table}

Sound is nuanced and challenging to describe accurately with language. We paint multiple examples from the held-out set of the VGG Sound~\cite{chen2020vggsound} dataset that were all categorized as ``female singing", Figure~\ref{fig:female_singing}. We attempt to describe the sound used and capture stills from the YouTube videos that these sounds were scraped from. Despite all the sounds having the same category and some similarities to sound, instrumentation, and atmosphere, the generated paintings are vastly dissimilar to each other. The paintings generated from the sound of the video were also vastly different from those generated from just the text description of the sound, thereby demonstrating the importance of sound as an input modality since it cannot be represented well with other modalities.


\subsection{Natural Sound Guided Image Manipulation}




Figure \ref{fig:audio_edits} further emphasizes that this technique enables a wide variety of unique painting manipulations that improves the existing framework and makes it more robust. We contrast the image editing capabilities of sound and text in Figure \ref{fig:audio_text} in order to emphasise the distinctions between the proposed sound mode and the current text modalities.  The figure shows that sound-based guidance tends to lead the painting generation to follow a more distinct, and arguably more meaningful path, than the text-based guidance. These results indicate that the audio-based semantic assistance allows the user to make creative adjustments and synthesize unique artworks. 



\begin{figure}[ht]
    \centering
    \includegraphics[width=\columnwidth]{text_audio.png}
    \caption{Comparison of modified paintings with text-guided and sound-guided semantic manipulation. The figure shows the source images (first column), their paintings drawn by FRIDA with text-guidance (middle column) and the paintings drawn by FRIDA with audio-guidance (last column) }
    \label{fig:audio_text}
\end{figure}




\begin{figure}
    \centering
    \includegraphics[width=\columnwidth]{figures/frida_music.png}
    \caption{Painting using various Pop songs as input. Genres of the songs from left to right are Disco, Traditional American Folk, and Rap/Hip-Hop}
    \label{fig:frida_music}
\end{figure}



\begin{figure*}
    \centering
    \small
   
    \includegraphics[width=\textwidth]{figures/emotion_and_speech.png}\vspace{-7pt}
    \caption{Top row images: Painting with only guidance from the emotions in the ArtEmis dataset. Bottom row: painting with emotion and the text ``A house and a tree.''}%
    \label{fig:pure_emotion}\vspace{-10pt}%
\end{figure*}

\begin{figure}[ht]
    \centering \small
    \includegraphics[width=0.8\columnwidth]{audio_and_other_modes_vertical.png}\vspace{-7pt}
    \caption{Combining audio as an input modality with other modalities that FRIDA can handle.}%
    \label{fig:audio_and_other_modes}%
\end{figure}

\subsection{Emotion Guidance}
We paint using only emotional guidance in Figure~\ref{fig:pure_emotion} to show its effects in isolation. The content is not usually apparent in these images, but the colors and shapes do correlate with their given emotion. For example, the shape of an animal with its tongue is out for amusement, sports teams for excitement, and religious iconography for awe. Furthermore, in Figure~\ref{fig:emotion_grid}, we demonstrate a gradual progression of the influence of emotion guidance, beginning from a minimal intensity to a maximal guidance strength. As we increase the strength, we again observe the changes in the color and content of the synthesized paintings in relation to the intended emotions. The color scheme for the emotion of `amusement' is more vivid, dull for `sadness' and leans more towards the color red for the emotion of `anger'. Additionally, for emotions such as `sadness', we can also discern the alteration in the facial expression of the reference image and the synthesized paintings, providing a more detailed analysis of the effect of emotion guidance on the output.


In a similar Mechanical Turk survey to the sound study, we presented participants with one of the eight emotions in Figure~\ref{fig:pure_emotion} (omitting ``Something Else") in writing along with randomized order of paintings generated using the eight emotions. Participants were asked to identify which painting looks most like the given emotion. The 75 participants were correct 26.5\% of the time (random guess would be 12.5\%).
The most common correctly identified emotion was Awe at 60\% correct and the most commonly confused emotions were Amusement incorrectly identified as Excitement 80\% of the time.

\subsection{Sound and Various Other Modality Guidance}

We paint with audio inputs paired with FRIDA's existing modalities in Figure~\ref{fig:audio_and_other_modes}.  Loss function weights are adjusted to allow the appearance of both modalities to become prominent.  The results are abstract, but represent both modalities strongly.

\subsection{Music Guidance}

The VGG-Sound dataset used to train our sound encoder contains some music in various forms. Music is a modality that is extremely challenging to represent in any other form such as language or images accurately.  We experiment with the generalization of the image encoder in Figure~\ref{fig:frida_music}.  The results are abstract but make some resemblance to content and general atmosphere of the given songs.


\section{Discussion}

This work forms as ground work towards creative human-robot interaction research. Increasing the input space of the robotic painting system should increase the control of the user as they have more modalities with which to communicate their high-level goals.  This may allow the human interacting with such systems to feel more ownership over the artwork created such that the human and robot are collaborators rather than creative automation. 

Sound can act as an accessible input modality for people with visual impairments who like to paint.  In future work, we hope to engage with different communities of people who face physical barriers to performing painting  to enable more people to express themselves via the visual art of painting.

\subsection{Limitations}

The generalization of the content our approach is capable of producing is dependent on its training data. The image-to-emotion prediction model we use is trained on the ArtEmis\cite{achlioptas2021artemis} which pairs paintings with their labelled emotions. Because art data was used, the model is biased towards guiding our images towards this distribution and away from other image distributions, such as photographs. While the VGG-Sounds~\cite{chen2020vggsound} is large with more than 300 categories, it is still not completely general and we see its limitations in examples such as music.

As in other generative art tasks, evaluating the quality of produced outputs remains challenging.


\section{Conclusions}

When compared to text, sound can contain additional information beyond semantic meanings including subtle nuances and emotions. In this paper, we present an approach for using sound and speech to guide the robotic painting process. Preliminary survey results show that our approach generates paintings imbued with the intended emotions. The proposed approach has been fully integrated with FRIDA~\cite{frida2022}, an existing robotic painting framework to produce physical paintings.
By allowing artists and researchers to synthesize images from musical inputs, emotions, and natural sounds, we have opened up new avenues for exploration and innovation in creative AI. Our approach enables image synthesis from purely sound or emotion inputs whereas previous work is limited to sound-guided image manipulation. As our approach does not rely on a pretrained image synthesis model, it is not constrained by any training dataset and can produce images only constrained by the image-audio encoders.  This allows for painting from musical inputs, emotions as well as various natural sounds. 

\section*{Ethics Statement}

Some pretrained models used to create content in our approach were trained on large datasets, such as VGG-Sound~\cite{chen2020vggsound},  that are not completely vetted for harmful information as they are scraped from the internet. Models such as CLIP~\cite{radford2021-clip}, which we use for generating visual content from text, carry biases and harmful representations from its training data~\cite{birhane2021stereotypesInLAION}. For this reason, we do not recommend that this work be used in scenarios where people may be negatively impacted by these biases.

\section*{Acknowledgement}
This work was supported by NSF IIS-2112633 and the Technology Innovation Program (20018295, Meta-human: a virtual cooperation platform for a specialized industrial services) funded By the Ministry of Trade, Industry \& Energy (MOTIE, Korea).


\bibliographystyle{named}
