\documentclass[pmlr]{jmlr}% new name PMLR (Proceedings of Machine Learning Research)
% Template adapted for the 1st Workshop on Emerging AI Technologies for Music, as part of AAAI
% https://amaai-lab.github.io/EAIM2026/

 % The following packages will be automatically loaded:
 % amsmath, amssymb, natbib, graphicx, url, algorithm2e
\usepackage{amsbsy,bm,upgreek,nicefrac,fixmath}
\usepackage{microtype}
 %\usepackage{rotating}% for sideways figures and tables
\usepackage{longtable}% for long tables

 % The booktabs package is used by this sample document
 % (it provides \toprule, \midrule and \bottomrule).
 % Remove the next line if you don't require it.
\usepackage{booktabs}
 % The siunitx package is used by this sample document
 % to align numbers in a column by their decimal point.
 % Remove the next line if you don't require it.
%\usepackage[load-configurations=version-1]{siunitx} % newer version
\usepackage{siunitx}

\usepackage{multicol,multirow}
\usepackage{float}
\usepackage{makecell}
%\usepackage{subcaption}

\usepackage[table]{xcolor}
\definecolor{gray1}{RGB}{225, 225, 225}
\definecolor{gray2}{RGB}{195, 195, 195}
\setlength{\fboxsep}{1pt} % Remove padding inside colorbox

 % The following command is just for this sample document:
\newcommand{\cs}[1]{\texttt{\char`\\#1}}

 % Define an unnumbered theorem just for this sample document:
\theorembodyfont{\upshape}
\theoremheaderfont{\scshape}
\theorempostheader{:}
\theoremsep{\newline}
\newtheorem*{note}{Note}

 % change the arguments, as appropriate, in the following:
\jmlrvolume{303}
\jmlryear{2026}
\jmlrworkshop{EAIM2026 at AAAI}

%\title[Short Title]{Template for EAIM2026\titletag{\thanks{sample footnote}}}
\title[Sleep Music Generation]{A Novel Diffusion Model Based Approach for Sleep Therapeutic Music Generation}

 % Use \Name{Author Name} to specify the name.

 % Spaces are used to separate forenames from the surname so that
 % the surnames can be picked up for the page header and copyright footer.
 
 % If the surname contains spaces, enclose the surname
 % in braces, e.g. \Name{John {Smith Jones}} similarly
 % if the name has a "von" part, e.g \Name{Jane {de Winter}}.
 % If the first letter in the forenames is a diacritic
 % enclose the diacritic in braces, e.g. \Name{{\'E}louise Smith}

 % *** Make sure there's no spurious space before \nametag ***

 % Two authors with the same address
  % \author{\Name{Author Name1\nametag{\thanks{with a note}}} \Email{abc@sample.com}\and
  %  \Name{Author Name2} \Email{xyz@sample.com}\\
  %  \addr Address}
  %\author{\Name{Author Name1} \Email{abc@sample.com}\and
 %  \Name{Author Name2} \Email{xyz@sample.com}\\
  % \addr Address}

 % Three or more authors with the same address:
 % \author{\Name{Author Name1} \Email{an1@sample.com}\\
 %  \Name{Author Name2} \Email{an2@sample.com}\\
 %  \Name{Author Name3} \Email{an3@sample.com}\\
 %  \Name{Author Name4} \Email{an4@sample.com}\\
 %  \Name{Author Name5} \Email{an5@sample.com}\\
 %  \Name{Author Name6} \Email{an6@sample.com}\\
 %  \Name{Author Name7} \Email{an7@sample.com}\\
 %  \Name{Author Name8} \Email{an8@sample.com}\\
 %  \Name{Author Name9} \Email{an9@sample.com}\\
 %  \Name{Author Name10} \Email{an10@sample.com}\\
 %  \Name{Author Name11} \Email{an11@sample.com}\\
 %  \Name{Author Name12} \Email{an12@sample.com}\\
 %  \Name{Author Name13} \Email{an13@sample.com}\\
 %  \Name{Author Name14} \Email{an14@sample.com}\\
 %  \addr Address}

 % Authors with different addresses:
 % \author{\Name{Author Name1} \Email{abc@sample.com}\\
 % \addr Address 1
 % \AND
 % \Name{Author Name2} \Email{xyz@sample.com}\\
 % \addr Address 2
 %}

 \author{\Name{Timo Hromadka} \Email{th716@cam.ac.uk}\\
  \Name{Kevin Monteiro} \Email{kidrm2@cam.ac.uk}\\
  \Name{Sam Nallaperuma} \Email{snn26@cam.ac.uk}\\
  \addr Department of Computer Science and Technology,\\
        University of Cambridge}

\editors{D. Herremans, K. Bhandari, A. Roy, S. Colton, M. Barthet}

\begin{document}

\maketitle

\begin{abstract}
Sleep disorders, particularly insomnia, and mental health conditions affect a significant fraction of adults worldwide, posing serious mental and physical health risk. Music therapy offers promising, low-cost, and non-invasive treatment, but current approaches rely heavily on expert-curated playlists, limiting scalability and personalization. We propose a low-cost generative system leveraging recent advances in diffusion models to synthesize music for therapy. We focus on insomnia and curate a dataset of waveform sleep music to generate audio tailored to sleep. To ensure real-world feasibility, we optimize our system for training and use on a single GPU, balancing quality and efficiency through extensive ablation studies. We show through subjective human evaluations that our generated music matches or outperforms existing baselines in both perceived quality and relevance to sleep therapy, while using only a fraction of the computational cost.
\end{abstract}
\begin{keywords}
Diffusion Models, Music Therapy, Insomnia, Mental Health, Sleep Music Generation
\end{keywords}

\section{Introduction}
\label{sec:introduction}
% \begin{note}
% This is an numbered theorem-like environment that was defined in
% this document's preamble.
% \end{note}

It is estimated nearly a third of the world population suffers from insomnia \citep{bhaskar_prevalence_2016}, with one in ten suffering from chronic insomnia \citep{ellis_chronic_2023, riemann_insomnia_2022}. 
Mental health conditions such as depression and anxiety are also known to be comorbid with insomnia \citep{palagini2022sleep, blank2015health}, and together they contribute to many lifestyle diseases with severe impact on health such as lessened metabolic, immunologic, and cardiovascular health. Additionally, sleep disruption increases susceptibility to numerous chronic conditions \citep{ramar_sleep_2021, medic_short-_2017, van_cauter_impact_2007, gebara_effect_2018}. The progressive decline in sleep duration in the public is considered a public health epidemic \citep{consensus_conference_panel_joint_2015} and most common treatments for insomnia are drug based, which have often been associated with adverse side effects.

Music therapy is emerging as an effective, non-invasive, and low-cost solution in the treatment of sleep and mental health disorders 
\citep{huang_insomnia_2023, mohamad_zamani_insomnia_2022, jespersen_music_2015, lee2013does}, as there is strong evidence that music can improve sleep quality \citep{jespersen_music_2015, dickson_music_2020, loewy_music_2020, wang_music_2014} as well as the recovery approach in mental health care \citep{lee2013does}. However, producing music for therapy on-demand %for insomnia 
is underexplored \citep{richter_towards_2019}. Existing systems are static, lacking the adaptability to cater to individual needs. Moreover, these approaches rely on expert therapists to curate music selections, limiting their accessibility and scalability for broader public use.

To address the up-and-coming field of music therapy generation, this paper focuses on low-cost sleep music generation that can be both trained and deployed efficiently, making it accessible for wider use, and operates on publicly available sleep music. To summarize, our main contributions are as follows: \textbf{(a)} we are the first to focus on generative sleep therapy using waveform audio, a more expressive and available audio data format, that allows for greater adaptability to personal preferences, surpassing the limitations of previous MIDI-based or algorithm-based approaches; \textbf{(b)} we are the first to apply the BigVGAN vocoder to music generation, demonstrating its superior performance despite being originally trained for speech synthesis; \textbf{(c)} we propose multiple model architectures that offer various trade-offs between generation quality and computational efficiency, with our fastest models leveraging a latent diffusion framework combined with the BigVGAN vocoder. We specifically focus on training models on a single consumer GPU, employing a minimal latent diffusion architecture and carefully chosen hyperparameters to maximize efficiency without compromising performance.; \textbf{(d)} we curate a specialized dataset of sleep music and \textbf{(e)} through comprehensive objective and subjective evaluations, including human listener studies, we show that our models outperform existing methods in terms of music relevance to sleep therapy. We achieve these improvements while maintaining smaller model size and reduced computational requirements compared to other state-of-the-art music models.

\section{Background in Computer Music Generation}
\label{sec:background}

Previously, music intervention studies often relied on generic sleep music with no expert curation \citep{loewy_music_2020, wang_music_2014, richter_towards_2019}. While some studies improve upon this by involving expert-selected music \citep{dickson_music_2020, jespersen_music_2015}, it has been shown that music preference is individualistic and could have an impact on music therapy \citep{chang_effects_2012, yamasato_how_2020}. Moreover, there still remains a critical gap: across nearly all studies there is a severe lack of analysis on the specific sleep music characteristics \citep{loewy_music_2020}, including melody, harmony, rhythm, etc. Implementing personalized music interventions on a large scale remains costly and logistically challenging. 

Music generation methods can be broadly categorized into MIDI-based and waveform-based approaches. 
%MIDI (Musical Instrument Digital Interface) is a symbolic representation of music that encodes information about musical notes, such as pitch, duration, velocity, and timing, in a standardized computer-readable format. 
Transformers \citep{vaswani_attention_2023} have been widely employed to sequentially model MIDI notes, treating the task as a language modeling problem \citep{huang_pop_2020, qu_mupt_2024, zhang_learning_2023, jiang_transformer_2020}. However, MIDI only provides instructions for synthesizing sound; it does not directly produce audio and requires additional processing through instruments or software synthesizers to generate sound, thus lacking the expressiveness and fidelity required for generating full, synchronized soundtracks. As such, MIDI datasets need to often be created manually, in contrast to being able to use readily available sleep music in the audio (waveform) format.

On the other hand, learning from raw audio (waveform) provides both flexibility and ease of access to training data. Instead of needing to build music from MIDI instructions, music can be directly created from waveforms, allowing for more naturally sounding and expressive results. Additionally, this approach benefits from a large amount of sleep music available as waveforms, whereas MIDI versions of these songs are rarely accessible. 

Specifically for sleep music, only two music generation approaches have been put forth so far. The first uses MIDI-based music generation \citep{yang_sleepgan_2022}, while the other utilizes an algorithmic approach using randomization and Markov Chains to construct music elements \citep{tulilaulu_sleep_2012}. Thus, no works up to date have focused on generating expressive sleep music from waveform audio. As a result, existing methods lack the ability to produce music that is both highly expressive, natural-sounding, and adaptable to users' preferences, limiting their effectiveness. Nevertheless, we cover the most popular approaches in literature for general music generation.

Recently, diffusion models have shown exceptional performance in generative AI tasks involving waveform audio. Existing approaches in the literature can be categorized as follows:
\textbf{(a)} Spectrogram-based Diffusion Models, which process spectrogram data using encoders such as VAEs and employ a neural vocoder to convert spectrograms back into audio \citep{chen_musicldm_2023, liu_audioldm_2023, huang_noise2music_2023, yang_diffsound_2023, huang_make--audio_2023}, or operate directly on waveforms \citep{schneider_mousai_2023, schneider_archisound_2023, li_jen-1_2023};
\textbf{(b)} Diffusion Transformers (DiT), which generate audio by operating in the latent space, and use a combined diffusion model and transformer approach \citep{agostinelli_musiclm_2023, ning_diffrhythm_2025};
\textbf{(c)} Auto-regressive Language Models (most commonly transformers), which generate audio based on trained codecs \citep{copet_simple_2023, evans_long-form_2024, lan_high_2024, wu_towards_2024, wu_codec-superb_2024}; and
\textbf{(d)} Alternative Generative Methods, such as Consistency Autoencoders (CAE) \citep{pasini_music2latent_2024}.

However, in the field of music generation, few released models are able to accommodate use on a single consumer GPU, often requiring large amounts of compute to train and run for inference. Our research addresses this limitation by developing lightweight models that can be trained and deployed for inference on a single consumer GPU. 
%For a better comparison with our paper with papers from literature, please visit the Appendix (Table~\ref{tab:num_training_steps}).

\section{Proposed Approach}
Our framework consists of four main components: \textbf{(a)} the Waveform Processor, \textbf{(b)} Variational Autoencoder (VAE), \textbf{(c)} Diffusion Model, \textbf{(d)} and Vocoder.  
Figure~\ref{fig:proposed_framework} depicts the overall pipeline of the proposed approach.
%For a diagram of all four components, see Figure~\ref{fig:proposed_framework}).


\subsection{Framework Components}

\begin{figure}[h!]
    \centering
    \hspace*{-0.05\linewidth} 
    \includegraphics[width=1.1\linewidth]{images/background/proposed_approach_new_4.pdf}
    \caption{
        Diagram of our proposed framework, which consists of four components: Waveform Processor, VAE, Diffusion, and Vocoder. 
        Raw audio (waveform) of length \( T_s \) from our sleep dataset is downsampled to \( \text{sr} = 22050 \text{Hz} \) and converted to mono. It is then transformed into mel-spectrograms (\( X_m \)) with a time (x) resolution of \( T_{px} \) and a frequency (y) resolution of \( F_{mb} \). A pre-trained VAE encodes the mel-spectrogram into a compressed 2-dimensional latent space (\( z \)) with a compression ratio \( r \). During training (bold arrows), the diffusion model processes the batch once to predict the added noise in training samples, computing the loss for backpropagation. During inference (dashed arrows), the process starts with Gaussian noise and performs 1000 steps of ancestral sampling to generate a novel sample, which is passed through the VAE decoder to reconstruct the mel-spectrogram. Finally, the Vocoder converts the mel-spectrogram back into audio.
    }
    \label{fig:proposed_framework}
\end{figure}

The \textbf{Waveform Processor (a)} starts the pipeline by taking raw audio waveforms $x \in \mathbb{R}^{T_s}$ from the training dataset, where $T_s$ is the length of the audio signal in samples (e.g. an audio file with a sampling rate of 44.1kHz has 44,100 samples per second). To optimize computational efficiency, the audio is downmixed to mono and resampled to 22.05 kHz, unless stated otherwise. The processed waveform is then transformed into a mel-spectrogram $\mathbb{X}_{\text{m}} \in \mathbb{R}^{T_{px} \times F_{mb}}$, where $T_{px}$ is the time resolution and $F_{mb}$ is the number of mel-frequency bins. The mel-spectrograms are sliced to the appropriate length for the next stage.

Next, the \textbf{Variational Autoencoder (VAE) (b)} encodes mel-spectrograms into a compact latent space, enabling efficient training and inference. Specifically, $\mathbb{X}_{\text{m}} \in \mathbb{R}^{T_{px} \times F_{mb}}$ is encoded into $\mathbf{z} \in \mathbb{R}^{C \times \frac{T_{px}}{r} \times \frac{F_{mb}}{r}}$, where $C$ is the latent channel size and $r$ is the downsampling factor. The decoder reverses this process, reconstructing a mel-spectrogram from the latent representation. Once the VAE is trained, its parameters are frozen during the training and inference stages of the diffusion model. The specific configuration for our VAE model is based on the Stable Diffusion repository \citep{rombach_high-resolution_2022} and the VAE architecture has an embedding dimension of 1, meaning the latent space is compressed to a single channel.

Then, the \textbf{Diffusion Model (c)} generates novel samples $\mathbf{z}' \in \mathbb{R}^{C \times \frac{T_{px}}{r} \times \frac{F_{mb}}{r}}$ from Gaussian noise, either in the pixel space (if no VAE is used), or in latent space (if the VAE is used). Using a U-Net backbone, the model progressively denoises the input noise through a reverse diffusion process over $N$ iterative steps. During training, the model learns to predict the noise added at each step of the forward diffusion process. In inference, it generates novel samples by iteratively reversing this process, starting from pure Gaussian noise.

Finally, the \textbf{Vocoder Component (d)} converts the generated mel-spectrograms $\mathbb{X}_{\text{mel}}' \in \mathbb{R}^{T_{px} \times F_{mb}}$ back into waveform audio $\mathbb{X}' \in \mathbb{R}^{T_s}$. Since phase information of mel-spectrograms is no longer present, this approach is not straightforward. We experiment with two approaches to make the conversion back to audio: the deterministic Griffin-Lim algorithm and the BigVGAN neural vocoder (available from the official NVIDIA BigVGAN GitHub repository). Specifically, we use the v2 model version (\texttt{bigvgan\_v2\_44khz\_128band\_256x}) which was pretrained on a diverse dataset, originally for speech synthesis, but we show it performs well for the music generation tasks.

\subsection{Training Dataset}
To train our model, we selected a suitable dataset of sleep music. We used samples from the publicly available Spotify Sleep Playlist Dataset \citep{scarratt_audio_2023}.
%, filtering for songs explicitly labeled with "sleep" as one of their genres to exclude upbeat tracks. 
After filtering, the dataset has 19,000 30-second samples, amounting to $\sim$158 hours of music.
%The curated dataset can be found on our HuggingFace, under the username <redacted> (username redacted for submission privacy). 
For brevity, this paper often refers to this dataset as \texttt{SSD} (\textbf{S}potify \textbf{S}leep \textbf{D}ataset).

\subsection{Hyperparameters and Training Details}

\label{hyperparameters_and_training_details}

This section describes important hyperparameters and technical details  of the relevant components used in our approach. An overview of the parameters is shown in Table~\ref{tab:hyperparameters_table}.


\begin{table}[]
\centering
\scalebox{0.9}{
\footnotesize
\renewcommand{\arraystretch}{1.1}
\begin{tabular}{p{2cm}|p{4cm}p{8.5cm}}
\toprule
\textbf{Component} & \textbf{Hyperparameter} & \textbf{Value} \\ 
\midrule
\multirow{5}{*}{\makecell{\textit{Waveform}\\\textit{Processor}}} 
    & Sampling Rate                      & 22.05\,kHz or 44.1\,kHz                                                                     \\ 
    & Mel-spec Pixel Width               & 512\,px or 2048\,px                                                                    \\ 
    & Mel Bands                          & \colorbox{lightgray}{Variable}                                                                                 \\ 
    & Hop Length                         & \colorbox{lightgray}{Variable}                                                                                \\ 
    & FFT Size                           & \colorbox{lightgray}{Variable}                                                                                 \\ 
\midrule
\multirow{9}{*}{\makecell{\textit{VAE}}} 
    & KL Divergence Reg. Term            & \(1 \times 10^{-6}\)                                                                   \\ 
    & Discriminator                      & After 50,001 steps                                                                     \\ 
    & Base Channel Width                 & 32                                                                                     \\ 
    & Channel Multipliers                & \([1, 2, 4, 4]\)                                                                       \\ 
    & Input/Output Channels              & 1                                                                                      \\ 
    & Residual Blocks per Level          & 2                                                                                      \\ 
    & Learning Rate                      & \(4.5 \times 10^{-6}\)                                                                 \\ 
    & Batch Size                         & 32                                                                                     \\ 
    & EMA                                & \(\text{inv. gamma} = 1.0, \text{power} = \frac{3}{4}, \text{max decay} = 0.9999\) \\ 
    & Compression Rate                   & \colorbox{lightgray}{Variable}                                                                                \\ 
\midrule
\multirow{11}{*}{\makecell{\textit{Diffusion}\\\textit{Model}}} 
    & Down Blocks                        & {\footnotesize\texttt{[DownBlock, DownBlock, DownBlock, DownBlock, AttnDownBlock, DownBlock]}} \\ 
    & Up Blocks                          & {\footnotesize\texttt{[UpBlock, AttnUpBlock, UpBlock, UpBlock, UpBlock, UpBlock]}} \\ 
    & Out Block Channels                 & \([128, 128, 256, 256, 512, 512]\)                                                    \\ 
    & Learning Rate                      & \(10^{-4}\) w/ 500 warmup steps                                                       \\ 
    & Adam Optimizer                     & \(\beta_1 = 0.95, \beta_2 = 0.999, \epsilon = 10^{-8}, \text{weight decay} = 10^{-6}\) \\ 
    & EMA                                & \(\text{inv. gamma} = 1.0, \text{power} = \frac{3}{4}, \text{max decay} = 0.9999\) \\ 
    & Training Steps                     & 200k Steps                                                                       \\ 
    & Noising Parameters                 & \(\beta_1 = 10^{-4}, \beta_T = 0.02\)                                                 \\ 
    & Noising Schedule                   & Cosine                                                                                \\ 
    & Inference Steps                    & 1000 DDPM steps or 100 DDIM steps                                                   \\ 
    & Batch Size                         & 16 or 8                                                                               \\ 
\midrule
\multirow{2}{*}{\makecell{\textit{Vocoder}}} 
    & Neural Vocoder                     & \(\text{bigvgan\_v2\_44khz\_128band\_256x}\)                                          \\ 
    & Griffin-Lim Iterations             & 32                                                                                     \\ 
\bottomrule
\end{tabular}
}
\caption{Hyperparameters for each component. While most parameters are fixed throughout the paper following standard literature practices, others are the focus of the paper and vary during individual experiments and are marked as `\colorbox{lightgray}{Variable}'.}
\label{tab:hyperparameters_table}
\end{table}


\section{Evaluation Setup}
We perform objective and subjective evaluation for the generated music and compare our model to recent baseline models.

\subsection{Objective Assessments}
In our experiments, generated samples are evaluated using the Fr\'{e}chet Audio Distance (FAD) metric \citep{kilgour_frechet_2019}. 
FAD measures the similarity between the distributions of two sets of audio tracks, one generated and one reference, by computing the Fr\'{e}chet Distance \citep{heusel_gans_2018}, also known as the Wasserstein-2 distance, between their respective embeddings. 
We leverage the library 
and incorporate improved parameters for FAD calculation from \citep{gui_adapting_2024}, addressing their recent findings that traditional FAD metrics may not reliably align with human judgment. Specifically, the widely used VGGish \citep{hershey_cnn_2017} embedding model has been shown to inadequately reflect human perception. Instead, we leverage the CLAP music model backbone \citep{wu_large-scale_2024}, which offers a stronger alignment with human evaluations, and comes in two variations: \texttt{clap-laion-audio} and \texttt{clap-laion-music}.
%
This is in line with other new works such as \citep{novack_ditto_2024, manor_zero-shot_2024}, 
which also adopt the CLAP model backbone to compute embeddings for FAD calculation.
%
To remain consistent with recent literature, we nonetheless still report VGGish scores for comparison. For brevity, from now on we refer to FAD scores calculated using each of the three models as \texttt{FAD\textsubscript{cla}} , \texttt{FAD\textsubscript{clm}} , and \texttt{FAD\textsubscript{vgg}} , respectively.

We use the FMA Pop dataset \citep{defferrard_fma_2017} as the reference set for FAD calculation to evaluate music quality, following recommendations for generative music tasks \citep{gui_adapting_2024}. The FMA Pop dataset provides a more reliable benchmark for assessing music quality compared to the commonly used MusicCaps dataset \citep{agostinelli_musiclm_2023}. Next, to specifically evaluate how closely our generated samples align with sleep music characteristics, we compute FAD scores using our curated Spotify Sleep Dataset as a reference set. Throughout this paper, we report FAD scores for both reference sets of FMA Pop and the Spotify Sleep Dataset. To prepare our reference sets for FAD calculation, we download the FMA Pop dataset \citep{defferrard_fma_2017} from the official repository. 

\subsection{Subjective Assessments}
We conduct a subjective human study following the same procedure as in existing studies~\citep{li_jen-1_2023, copet_simple_2023, kreuk_audiogen_2023}, to evaluate the generated music. 


Human participants were asked to rate: \textbf{(a)} the overall perceived quality of the generated audio samples (\textbf{Qual}), and \textbf{(b)} the relevance of the generated audio samples to sleep music (\textbf{Rel}), both on a scale of 1 to 100. We leverage the Amazon Mechanical Turk platform to recruit participants and ensure they are paid at least the UK national minimum wage. Noisy annotations and outliers are dropped, such that responses from participants who did not listen to the full audio samples and/or annotators who rated the reference audio samples less than 85 are discarded. All audio samples were also normalized to -20.0 dBFS for fairness.

\subsection{Baseline Models}
Based on the findings from our literature review, we select AudioLDM \citep{liu_audioldm_2023} and MusicGen \citep{copet_simple_2023} as baseline models for comparison with our proposed model.
We choose these models as they have publicly available APIs to generate samples and also offer configurations with small enough model sizes, which is in line with our goal of developing lightweight models for resource-constrained environments. We deploy these respective baseline models using the publicly available implementation on HuggingFace and generate samples by providing the prompt ``relaxing sleep music perfect for sleep therapy" as well as other inputs based on the respective author's recommendations.
Sleep music generated by our model is filtered (similar to how the best candidates are selected in the official implementation of AudioLDM \citep{liu_audioldm_2023}) to ensure the quality of the presented samples. We compute individual FAD scores \citep{gui_adapting_2024} for these generated samples and select a subset with the top scores for further stages. 

\section{Evaluation Results}
%We then include a brief analysis of the musical alignment between samples generated by our model and sleep music by comparing extracted musical features.
In this section, we first present a comparison between different model configurations followed by our objective and subjective evaluation results.

\subsection{Model Configuration Comparison}
Our ablation experiments showed that we can improve FAD scores by small architectural tweaks. To further enhance performance, we replace the Griffin-Lim algorithm for mel-spectrogram-to-audio conversion with a neural vocoder. We showcase all the main models and parameters side by side in Table~\ref{tab:combined_model_performance}.

Each long-sample model, corresponding to 2048$\times$128-pixel mel-spectrograms, was trained for approximately 72 hours, while each short-duration model, using 512$\times$128-pixel spectrograms, was trained for around 36 hours. All models were trained on a single A100 GPU. We selected these training durations to remain compute efficient, and to roughly have the same amount of training steps as other small models in the literature. Latent models performed worse than pure mel-spectrogram diffusion models, with noticeable improvements observed when employing the BigVGAN vocoder. The impact of the vocoder is more pronounced for shorter samples, suggesting that the generated mel-spectrograms are of higher quality. BigVGAN consistently outperforms the Griffin-Lim algorithm across all configurations, particularly when paired with a VAE during the diffusion process. This highlights BigVGAN's ability to mitigate reconstruction artifacts inherent in latent models. The strongest-performing setup combines a mel-spectrogram diffusion process with the BigVGAN vocoder.
In terms of efficiency, VAE-based configurations are markedly faster, with generation times reduced from 49.75 to 3.44 seconds for longer samples and from 12.13 to 1.44 seconds for shorter samples. The BigVGAN vocoder enhances sample quality while introducing only minimal computational overhead, further solidifying its advantage over the Griffin-Lim algorithm. Additionally, VAE models achieve faster-than-real-time generation. Across \textit{all} experiments, the DDIM sampler is also capable of faster than real-time generation, however, this paper focuses on reporting results using the DDPM sampler. These findings demonstrate the feasibility of utilizing diffusion models for real-time music generation systems.
We select the overall best performing model to carry forward into the human evaluation 
%and feature analysis 
section, which is a mel-spectrogram based diffusion model using a BigVGAN Vocoder producing samples of $\sim$12 seconds (\textit{Mel-spec + BigvGAN}).

\begin{table}[h!]
\scalebox{0.87}{
\scriptsize
\centering
\begin{tabular}{p{3.4cm}|p{1.2cm}p{1.3cm}p{4.3cm}|p{0.8cm}p{0.8cm}|p{0.8cm}p{0.8cm}}
\toprule
\textbf{Configuration}      & \textbf{Sample Length, secs+kHz} & \textbf{%
%\begin{tabular}[c]{@{}l@{}}Sampling\\Time\\DDPM/\\DDIM\\ (s)\end{tabular}
Sampling Time DDPM/ DDIM (s)
} & \textbf{Model Size (\#Params/MB)} & \multicolumn{2}{p{2.4cm}|}{\raggedright\textbf{FMA Pop Reference Set}} & \multicolumn{2}{p{2.4cm}}{\raggedright\textbf{SSD \newline Reference Set}} \\ 
                            &                           &                                        &              & \textbf{FAD\textsubscript{cla}} & \textbf{FAD\textsubscript{clm}} & \textbf{FAD\textsubscript{cla}} & \textbf{FAD\textsubscript{clm}} \\ 
\midrule
\multicolumn{8}{l}{\textbf{
\begin{tabular}[c]{@{}l@{}}2048$\times$128 Samples \\ (Longer)\end{tabular}
}} \\ 
Mel-spec               & 23.8 (22.05kHz)             & 49.75/4.975           & 113M/455MB           & 0.824                              & 0.917                              & 0.446 & 0.481 \\
Mel-spec (hl512 nfft1024)      & 47.5 (22.05kHz)              & 49.75/4.975           & 113M/455MB            & 0.949                              & 0.904                              & 0.697 & 0.736 \\
Mel-spec + 4$\times$ \colorbox{gray1}{VAE}             & 23.8 (22.05kHz)              & 3.44/0.344        & 113+\colorbox{gray1}{1.3}M/455+\colorbox{gray1}{5}MB            & 1.154                              & 1.190                              & 0.696 & 0.719 \\
Mel-spec + \colorbox{gray2}{BigVGAN}            & 11.9 (44.1kHz)              & 50.93/6.155      & 113+\colorbox{gray2}{112}M/455+\colorbox{gray2}{451}MB            & \textbf{0.607}                     & \textbf{0.630}                     & \textbf{0.357} & \textbf{0.460} \\
Mel-spec + 4$\times$ \colorbox{gray1}{VAE} + \colorbox{gray2}{BigVGAN}   & 11.9 (44.1kHz)              & 4.62/1.524           & 113+\colorbox{gray1}{1.3}+\colorbox{gray2}{112}M/455+\colorbox{gray1}{5}+\colorbox{gray2}{451}MB           & -                         & -                             & - & - \\ 
\midrule
\multicolumn{8}{l}{\textbf{
\begin{tabular}[c]{@{}l@{}}512$\times$128 Samples\\ (Shorter)\end{tabular}
}} \\ 
Mel-spec               & 5.9 (22.05kHz)               & 12.13/1.213           & 113M/455MB           & 0.988                              & 1.012                              & 0.692 & 0.628 \\
Mel-spec (hl512 nfft1024)      & 11.9 (22.05kHz)             & 12.13/1.213           & 113M/455MB           & 0.807                     & 0.879                              & 0.484 & \textbf{0.506} \\
Mel-spec + 4$\times$ \colorbox{gray1}{VAE}             & 5.9 (22.05kHz)              & 1.44/0.144            & 113+\colorbox{gray1}{1.3}M/455+\colorbox{gray1}{5}MB           & 1.119                              & 1.129                              & 0.894 & 0.760 \\
Mel-spec + \colorbox{gray2}{BigVGAN}            & 3.0 (44.1kHz)                 & 11.585/1.343         & 113+\colorbox{gray2}{112}M/455+\colorbox{gray2}{451}MB           & \textbf{0.806}                     & \textbf{0.807}                     & 0.613 & 0.671 \\
Mel-spec + 4$\times$ \colorbox{gray1}{VAE} + \colorbox{gray2}{BigVGAN}   & 3.0 (44.1kHz)                 & 1.645/0.349            & 113+\colorbox{gray1}{1.3}+\colorbox{gray2}{112}M/455+\colorbox{gray1}{5}+\colorbox{gray2}{451}MB           & 0.832                              & 0.821                              & \textbf{0.478 }& 0.541 \\ 
\bottomrule
\end{tabular}
}
\caption{FAD Score Evaluation on the FMA\_Pop Dataset for Models With Different Configurations and Sample Sizes. Smaller Sample Sizes (512$\times$128) and Longer Sample Sizes (2048$\times$128) are indicated in the Table. The Mel-spectrogram + 4$\times$ \colorbox{gray1}{VAE} + \colorbox{gray2}{BigVGAN} configuration is omitted as training was unable to converge within the allocated training time.}
\label{tab:combined_model_performance}
\end{table}


\subsection{Comparison with the Baselines}
\label{sec:comparison_with_baseline}

\begin{table}[h!]
\scriptsize
\centering
\scalebox{0.9}{
\renewcommand{\arraystretch}{1.4}
\begin{tabular}{p{1.7cm}p{1.4cm}|p{1.7cm}p{1.7cm}|p{1cm}p{1cm}p{1.2cm}|p{1cm}p{1cm}p{1cm}}
\toprule
%\multirow{2}{*}{\textbf{Model\\(50 samples)}} 
\multirow{2}{*}{\textbf{\begin{tabular}[c]{@{}l@{}}Model\\(100 samples)\end{tabular}}} 
& \multirow{2}{*}{\textbf{\# Params}} & \multicolumn{2}{c|}{\raggedright\textbf{}} & \multicolumn{3}{c|}{\raggedright\textbf{FMA Pop Reference Set}} & \multicolumn{3}{c}{\raggedright\textbf{SSD Reference Set}} \\
& & \textbf{Qual} $\uparrow$ & \textbf{Rel} $\uparrow$ & \(\mathbf{FAD_{\text{cla}} \downarrow}\) & \(\mathbf{FAD_{\text{clm}} \downarrow}\) & \(\mathbf{FAD_{\text{vgg}} \downarrow}\) & \(\mathbf{FAD_{\text{cla}} \downarrow}\) & \(\mathbf{FAD_{\text{clm}} \downarrow}\) & \(\mathbf{FAD_{\text{vgg}} \downarrow}\) \\
\midrule

Sleep Dataset (human composed)  & - & 94.72 {\tiny $\pm$ 0.81} & 92.31 {\tiny $\pm$ 2.23} & $0.704^\ast$ & $0.827^\ast$ & $11.656^\ast$ & 0.027 & 0.022 & 0.336 \\
\midrule
AudioLDM-S & 185M & 66.14 {\tiny $\pm$ 5.20} & 65.07  {\tiny $\pm$ 5.59} & \textbf{0.642} & 0.834 & \textbf{5.483} & 0.864 & 0.825 & 9.069 \\
MusicGen-S & 300M & 83.08 {\tiny $\pm$ 3.34} & 82.41 {\tiny $\pm$ 3.74} & 0.851 & \textbf{0.825} & 10.272 & 0.616 & 0.693 & 4.212 \\
\midrule
Proposed & \textbf{115M} & \textbf{83.70} {\tiny $\pm$ 3.23} & \textbf{85.74} {\tiny $\pm$ 3.03} & 0.823 & 0.849 & 11.609 & \textbf{0.251} & \textbf{0.416} & \textbf{2.782} \\

\bottomrule
\end{tabular}
}
\caption{Subjective (Qual and Rel) and Objective (FAD) comparison between the baseline models (AudioLDM, MusicGen) and our proposed model. FAD scores are computed using FMA Pop and the Spotify Sleep Dataset as reference as indicated. Lower FAD scores indicate greater similarity to the reference set and are typically preferred. Higher Qual and Rel, from the subjective evaluation surveys, indicate better perceived audio quality and relevance to sleep music and are therefore better.}

\label{tab:obj_subj_results}
\end{table}

Both  objective and subjective results are reported in Table~\ref{tab:obj_subj_results}.
%
We first compare our proposed model with the selected baseline models using objective metrics.
%
When using FMA Pop as the reference set, our model achieves FAD scores comparable to MusicGen-S while using only about 1/3 of the number of parameters. On the other hand, when using the Spotify Sleep Dataset as reference, the proposed model considerably out performs the baseline models demonstrating better objective alignment with sleep music characteristics.
%
Next, we present the mean opinion score and 95\% confidence intervals from the human evaluation study (subjective results). Again, the proposed model performs similar to MusicGen-S in both audio quality and perceived relevance to sleep music and outperforms AudioLDM-S across the same metrics.

We also observe high FAD values on the Spotify Sleep Dataset (100 samples) when using FMA Pop as the reference set. The FMA Pop set consists of studio recordings of pop songs while the Spotify Sleep Dataset consists of sleep music. Sleep music refers to audio that is typically instrumental and calming. 
It often features slow tempos, and may incorporate nature sounds such as rain, ocean waves, wind, etc. This difference in type of music between the Sleep Dataset and the FMA Pop set could explain the divergence seen, underscoring the importance of genre-specific reference sets and evaluation metrics when assessing generative models for specialized tasks such as sleep music generation.

Our model has considerably less parameters and required far less training time when compared to the baseline models.
%the training time for our model is much smaller too. 
We train our proposed model for 2 days on one A100 GPU with a batch size of 8, circa 200k steps, and similarly for our VAE (200k steps on a single A100 GPU). MusicGen-S trains for 1.5 million steps on 32 GPUs, and AudioLDM's VAE alone is trained on 1.5 million steps on a single GPU. Not only do we match performance but also surpass in terms of computational requirements. %For a more detailed breakdown, see the breakdown in the appendix (Table~\ref{tab:num_training_steps})

On the whole, the proposed model outperforms AudioLDM-S and achieves performance that is comparable to, if not better than, MusicGen-S.


\section{Conclusion and Future Work}

In this work, we developed lightweight generative models tailored specifically for sleep music. Objective and subjective evaluations show that our models produce high-quality audio, often outperforming other approaches in the literature. We also demonstrate the successful application of the BigVGAN vocoder for music generation, achieving high fidelity. Our experiments examined key design choices for model architectures, balancing efficiency (training and sampling speed) with output quality. The results indicate that our lightweight models generate sleep music with strong resemblance to real examples, supported by low Fr\'{e}chet Audio Distance (FAD) scores and similarities across acoustic and audio features. Key findings include the following: \textbf{(a)} using a curated sleep music dataset enables our models to achieve superior quality and sleep-music relevance as rated by human listeners with significantly fewer parameters compared to existing methods; \textbf{(b)} pretrained BigVGAN vocoders, originally designed for speech, are capable of high-quality music generation; \textbf{(c)} alternative mel-spectrogram configurations (e.g., non-standard \texttt{hop\_length}, \texttt{n\_fft}, and \texttt{mel\_bands}) outperform conventional literature settings; and \textbf{(d)} confirming that VAE compression exhibits diminishing returns, with excessive compression degrading audio quality, as indicated by higher FAD scores. Based on these findings, we select a middle ground of 4$\times$ compression for lightweight diffusion model training, balancing efficiency with audio quality. Future research focuses on investigating which musical features best support specific sleep phases and integrating user data to enable adaptive, real-time music generation that corresponds to a user's sleep state. Continuous generation techniques such as successive conditioning or outpainting offer promising directions. 



%\acks{Acknowledgements go here.}

\bibliography{eaim}

% \appendix

% \section{First Appendix}\label{apd:first}

% This is the first appendix.

% \section{Second Appendix}\label{apd:second}

% This is the second appendix.

\end{document}
