\documentclass[pmlr]{jmlr}

 % The following packages will be automatically loaded:
 % amsmath, amssymb, natbib, graphicx, url, algorithm2e

 %\usepackage{rotating}% for sideways figures and tables
\usepackage{longtable}% for long tables

\usepackage{booktabs}
\usepackage{siunitx} % newer version % newer version
\newcommand{\cs}[1]{\texttt{\char`\\#1}}

\jmlrvolume{303}
\jmlryear{2026}
\jmlrworkshop{EAIM2026 at AAAI}

\title[Silence as Music]{Silence as Music: Controllable and Interpretable AI for Strategic Silence Placement}

 % Use \Name{Author Name} to specify the name.

 % Spaces are used to separate forenames from the surname so that
 % the surnames can be picked up for the page header and copyright footer.
 
 % If the surname contains spaces, enclose the surname
 % in braces, e.g. \Name{John {Smith Jones}} similarly
 % if the name has a "von" part, e.g \Name{Jane {de Winter}}.
 % If the first letter in the forenames is a diacritic
 % enclose the diacritic in braces, e.g. \Name{{\'E}louise Smith}

 % *** Make sure there's no spurious space before \nametag ***

% Author information should be kept anonymous for the double-blind policy. Add in this information for the camera ready version.
 % Double-blind placeholder (replace in camera ready)
 \author{\Name{Gokul Srinath Seetha Ram} \Email{s.gokulsrinath@gmail.com}\\
 \addr Independent Author}

 % Three or more authors with the same address:
 % \author{\Name{Author Name1} \Email{an1@sample.com}\\
 %  \Name{Author Name2} \Email{an2@sample.com}\\
 %  \Name{Author Name3} \Email{an3@sample.com}\\
 %  \Name{Author Name4} \Email{an4@sample.com}\\
 %  \Name{Author Name5} \Email{an5@sample.com}\\
 %  \Name{Author Name6} \Email{an6@sample.com}\\
 %  \Name{Author Name7} \Email{an7@sample.com}\\
 %  \Name{Author Name8} \Email{an8@sample.com}\\
 %  \Name{Author Name9} \Email{an9@sample.com}\\
 %  \Name{Author Name10} \Email{an10@sample.com}\\
 %  \Name{Author Name11} \Email{an11@sample.com}\\
 %  \Name{Author Name12} \Email{an12@sample.com}\\
 %  \Name{Author Name13} \Email{an13@sample.com}\\
 %  \Name{Author Name14} \Email{an14@sample.com}\\
 %  \addr Address}


 % Authors with different addresses:
 % \author{\Name{Author Name1} \Email{abc@sample.com}\\
 % \addr Address 1
 % \AND
 % \Name{Author Name2} \Email{xyz@sample.com}\\
 % \addr Address 2
 %}

\editors{D. Herremans, K. Bhandari, A. Roy, S. Colton, M. Barthet}

\begin{document}

\maketitle

\begin{abstract}
AI music systems increasingly emphasize controllability and interpretable design. We propose a system that treats silence as a first-class compositional element and enables interactive shaping of silence placement through transparent analysis, cultural presets, and steerable controls. Our method constructs multiple candidate rest patterns from phrase boundaries, melodic tension, rhythmic heuristics, and cultural weights, then selects a mask via a quality function balancing rhythmic entropy, groove preservation, and structural coherence. We present baselines (random 10/25\%, phrase-only, tension-only, weak-beats), a proxy for language model without silence prompting, and our hybrid predictor. Across four canonical melodies and three cultural presets, our approach increases rhythmic variety while preserving groove and phrase alignment relative to baselines, offering an interpretable framework for co-creative composition. We release an API, offline demos, audio examples (WAV), and a comprehensive experiment suite to support interactive composition, pedagogy, and performance.
\end{abstract}
\begin{keywords}
controllability; interpretability; symbolic music; silence; rests; pedagogy; performance systems
\end{keywords}

\section{Introduction}
\label{sec:intro}

Silence is not absence; it is agency. In performance, rests breathe; in composition, space shapes expectation and release. While recent AI music systems excel at generating notes, they rarely provide fine-grained control over rests as intentional, expressive events. We propose a controllable, interpretable system that elevates silence to a steerable and culturally aware component of the compositional palette.

Our design goals are practical and interpretable: provide intuitive controls (density, phrase emphasis, tension emphasis, groove preservation), preserve transparency through analysis artifacts (phrase boundaries, tension points), and respect cultural practices via presets. The system outputs a binary silence mask aligned to the input melody, along with a compact explanation of why each rest is proposed.

Contributions:
\begin{itemize}
  \item A controllable pipeline for strategic rest placement using phrase, tension, rhythmic cues, and cultural presets.
  \item A selection function balancing rhythmic entropy, groove, and structural coherence with interpretable analysis outputs.
  \item An experiment suite with strong baselines, ablations, audio examples, and an API/CLI for reproducibility and co-creative workflows.
\end{itemize}

We summarize the pipeline in Section~\ref{sec:method}; Figures~\ref{fig:radar} and~\ref{fig:heat_mc} present key results; additional advanced visualizations appear in Section~\ref{sec:figures}.

Despite rapid advances in controllable symbolic generation, few systems explicitly expose silence as a tunable dimension of musical intention. EAIM’s emphasis on interpretability and human–AI collaboration directly motivates our approach.

\section{Related Work}
Silence in composition and performance is central to phrasing, tension, and form across traditions, yet has been underrepresented in AI systems which predominantly focus on generating pitches and durations. Controllable symbolic generation and long-range musical structure have advanced with sequence models \citep{huang2019music,mogren2023figaro,chen2024sympac}, and controllability in audio/music generation continues to mature \citep{copet2023musicgen,huang2024ruleguided,zhu2025ftg}. Expressive performance modeling has a long history in timing and dynamics \citep{widmer2003expression,medel2016performance,bresin2002directormusices}. Prior work often treats rests as byproducts of duration sampling or as implicit pauses emergent from performance timing. In contrast, we elevate rests to explicit, controllable targets with user-facing parameters and culturally grounded presets. Our quality metrics follow interpretable rhythmic measures (inter-onset interval variability; groove on strong beats) \citep{toussaint2004npvi,witek2014groove,madison2014groove,nelias2022swing}, and structural alignment via phrase boundaries \citep{cambouropoulos2001lbdm,kranenburg2020rulemining,guan2025phrase,hernandez2023graphstructure}, favoring transparent, musically meaningful objectives over opaque composites.

While prior controllable models target timbre or pitch structure, none formalize rests as explicit optimization targets; our formulation extends these frameworks toward silence control.

\section{Method}
\label{sec:method}

Given a monophonic melody of length $n$, we predict a binary mask $S\in\{0,1\}^n$ (1 = rest). We generate a set of candidate masks and select the best according to a transparent quality function.

\subsection{Formal Problem Statement}
Let $M=(m_1,\dots,m_n)$ be a symbolic melody on a discrete grid of $n$ time steps. A rest mask is $S\in\{0,1\}^n$ with $S_i=1$ indicating a rest at index $i$. The objective is to maximize a quality functional $Q$ measuring rhythmic variety, pulse preservation, and structural alignment:
\[\small
Q(M,S) = w_H\,H(S) + w_G\,G(S) + w_C\,C(M,S),\quad (w_H,w_G,w_C)\in\mathbb{R}_{\ge 0}^3.
\]
Subject to invariants: (i) no adjacent strong-beat rests; (ii) mask density within user- or preset-specified bounds; (iii) rest placements respect basic metrical constraints.

\subsection{Analysis Layer}
We compute: (i) phrase boundaries $B$ via melodic contour differentials and periodicity; (ii) tension points $T$ where leaps exceed a threshold $\tau$ semitones relative to local contour (default $\tau=3$); (iii) rhythmic context (strong/weak beats under a default 4/4 assumption); and (iv) a lightweight harmonic proxy from pitch class tendencies. The analyzer is deterministic, fast, and provides interpretable artifacts for explanation.

\subsection{Candidate Generation}
We construct candidates using complementary strategies:
\begin{itemize}
  \item Phrase: rests at boundaries in $B$ to create breathing points.
  \item Tension: rests following elements in $T$ to build anticipation.
  \item Weak-beat: rests on off-beats to increase syncopation while preserving downbeats.
  \item Cultural: weights strategies based on a preset (e.g., boundary emphasis in Western classical; rhythmic complexity in jazz; tala/raga sensitivity for Indian classical).
  \item Hybrid: union of selected strategies with conflict resolution to avoid adjacent strong-beat rests and excessive density.
\end{itemize}

\subsection{Quality Function and Selection}
For a candidate $S$, we compute: rhythm entropy $H(S)$ (diversity of inter-onset intervals, i.e., IOI variability \citep{toussaint2004npvi}), groove factor $G(S)$ (proportion of notes on strong beats \citep{witek2014groove,madison2014groove}), and structural coherence $C(S)$ (alignment of rests with $B$ \citep{cambouropoulos2001lbdm}). We select
\[
S^* = \arg\max_S \big(w_H H(S) + w_G G(S) + w_C C(S)\big)
\]
with hard constraints enforcing pulse continuity (no adjacent strong-beat rests). Default $(w_H, w_G, w_C)$ favor groove conservation and boundary coherence; users can adjust weights through the API.

\subsection{Prompt Refinement (Optional)}
When a language model is available, we optionally refine placements with a concise prompt:
\begin{quote}\small
Original: C C G G A A G F F E E D D C\\
Random: C SILENCE G G A SILENCE G F F E E D D C\\
Instruction: Place SILENCE to enhance phrasing (breathing at boundaries), build tension before resolution, preserve groove. Output the same length using note names and SILENCE.
\end{quote}
We also evaluate a no-silence-prompt proxy by removing rest guidance to isolate the effect of explicit silence control.

\subsection{Complexity and Implementation}
The analyzer and candidate generation are linear in $n$; metric evaluation is also linear. The system runs in real time for short melodies, enabling interactive usage. Advanced model components are guarded and degrade gracefully to the lightweight hybrid when unavailable.

% (Removed template Operator Names section)

\section{System Overview}
\textbf{Analysis.} Phrase detection, melodic contour/tension, harmonic/rhythmic cues.\\
\textbf{Prediction.} Lightweight hybrid of boundaries, tension, rhythmic heuristics; optional transformer.\\
\textbf{Cultural Adapter.} Presets for Western classical, jazz, Indian classical, African, East Asian, world music.\\
\textbf{User Controls.} Density, boundary vs. tension emphasis, groove preservation; explanations derived from analysis artifacts.

\begin{figure}[htbp]
\floatconts{fig:schematic}{\caption{Minimalist system schematic summarizing the end-to-end controllable silence pipeline enabling real-time user interaction.}}{\includegraphics[width=0.9\linewidth]{experiments_output/fig_system_schematic}}
\end{figure}

\begin{table}[t]
\floatconts{tab:cultural}{\caption{Cultural presets (example weights; 0--1 scale).}}{
\begin{tabular}{lccc}
\toprule
Preset & Boundary wt & Tension wt & Weak-beat wt \\
\midrule
Western classical & 0.8 & 0.4 & 0.2 \\
Jazz               & 0.4 & 0.7 & 0.8 \\
Indian classical   & 0.6 & 0.6 & 0.3 \\
\bottomrule
\end{tabular}}
\end{table}

\section{Experiments}
\label{sec:experiments}
\paragraph{Setup.} We test four canonical melodies (Twinkle, Mary Had a Little Lamb, Happy Birthday, C Major Scale) across three cultural presets (Western classical, jazz, Indian classical). We select canonical melodies to ensure interpretability and reproducibility across cultural presets. Methods: random 10\%, random 25\% (density-matched control), phrase-only, tension-only, weak-beats, no-silence-prompt proxy, and our hybrid. Metrics: $H, G, C$, and overall quality $Q$ with $(w_H, w_G, w_C)$ set to preserve groove and coherence while rewarding rhythmic variety.

\paragraph{Protocol and Statistics.} Each method is evaluated per melody and context, producing per-instance metrics written to CSV. We report means across melodies and contexts and visualize distributions (violin/box plots). For significance, we perform paired comparisons versus random baselines with two-tailed tests on per-instance $Q$, reporting $p$-values and Cohen's $d$ effect sizes; 95\% CIs are shown where space permits. Audio A/B examples accompany each melody for perceptual verification.

\paragraph{Results.} The hybrid consistently increases $H$ over random and single-strategy baselines while preserving $G$ (near-baseline) and maintaining high $C$. The no-silence-prompt proxy underperforms the hybrid, indicating the utility of explicit silence control. Heatmaps show robustness across cultural presets, with boundary emphasis particularly effective in Western classical and rhythmic emphasis effective in jazz. A summary table (auto-generated) reports mean$\pm$std $Q$ per method.

\paragraph{Ablations.} Removing boundary cues lowers $C$; removing tension cues reduces perceived anticipation; disabling groove constraint harms $G$; the no-silence-prompt proxy underperforms the hybrid in $Q$.

\paragraph{Audio Examples.} We provide WAVs for original, random, strategic (ours) synthesized from symbolic notes with a simple harmonic model, enabling qualitative inspection of temporal structure.



\subsection{Extended Analyses}
\textbf{Per-context robustness.} \figureref{fig:heat_mc} indicates robust $Q$ across presets, with stylistic differences aligning with cultural weights.\newline
\textbf{Multi-metric profile.} \figureref{fig:radar} shows consistent gains in $H$ while preserving $G$ and maintaining high $C$.

\subsection{Sensitivity Analysis}
We sweep weights $(w_H,w_G,w_C)$ and the density cap to examine stability of $Q$. \figureref{fig:sensw} shows mean $Q$ as a function of emphasizing each component in turn. Results indicate a broad plateau where $Q$ remains stable, with moderate emphasis on groove and coherence producing the most robust performance.

\subsection{Effect Sizes and CIs}
We compute paired differences in $Q$ per melody\,$\times$\,context against random\_25; \figureref{fig:delta} summarizes mean deltas. We report effect sizes in the text and provide confidence intervals in the supplementary table.

% (Removed template Tables section)

\subsection{Figures}
\label{sec:figures}

% (Removed template explanatory text)


% (Removed template includeteximage note)


% (Removed template sidewaysfigure note)

% (Removed template graphicspath note)

% (Removed template Sub-Figures section)


% Advanced experiment figures
\begin{figure}[htbp]
\floatconts{fig:avg_overall}{\caption{Average Overall Quality across methods.}}{\includegraphics[width=0.75\linewidth]{experiments_output/avg_overall_quality}}
\end{figure}

\begin{figure}[htbp]
\floatconts{fig:radar}{\caption{Method metric radar ($H$, $G$, $C$, Overall).}}{\includegraphics[width=0.75\linewidth]{experiments_output/fig_radar_metrics}}
\end{figure}

\begin{figure}[htbp]
\floatconts{fig:violin}{\caption{Overall Quality distributions by method.}}{\includegraphics[width=0.75\linewidth]{experiments_output/fig_violin_overall}}
\end{figure}

\begin{figure}[htbp]
\floatconts{fig:heat_mc}{\caption{Heatmap: Overall Quality by method x context.}}{\includegraphics[width=0.8\linewidth]{experiments_output/fig_heatmap_method_context}}
\end{figure}

\begin{figure}[htbp]
\floatconts{fig:ablation}{\caption{Ablations: phrase/tension/weak-beats vs. hybrid for each metric.}}{\includegraphics[width=0.9\linewidth]{experiments_output/fig_ablation_bars}}
\end{figure}

\begin{figure}[htbp]
\floatconts{fig:delta}{\caption{Delta Overall vs. random\_25 baseline (heatmap).}}{\includegraphics[width=0.8\linewidth]{experiments_output/fig_delta_vs_random}}
\end{figure}

\begin{figure}[htbp]
\floatconts{fig:sensw}{\caption{Sensitivity: $Q$ vs. weight emphasis (one component varied, others share remainder).}}{\includegraphics[width=0.75\linewidth]{experiments_output/fig_sensitivity_weights}}
\end{figure}


% Summary table
\input{experiments_output/table_results}

% Placeholder significance table (to be updated with computed stats)
\input{experiments_output/table_significance}

% Minimal polyphonic sanity table
\input{experiments_output/table_polyphony}

% (Removed template sub-figure examples)

% (Removed template Sub-Tables section)

% (Removed template Algorithms section)

\section{Discussion}
Strategic rests create space for breath, anticipation, and clarity. Results suggest a practical recipe: modest rest density aligned with phrase boundaries and selected tension points increases rhythmic variety without sacrificing pulse. Unlike black-box sequence models, our system exposes its internal heuristics, enabling both pedagogy and explainable composition. Beyond monophony, polyphonic extension invites voice-specific pacing, counter-rest design, and cadence-aware rests, with the same interpretability principles.

\section{Applications}
\textbf{Education and pedagogy.} Phrasing coach with real-time visual rests and A/B audio; assignments targeting boundary awareness.\newline
\textbf{Accessibility.} Visual pacing for hearing-impaired users and focused listening support.\newline
\textbf{Performance systems.} Live silence shaping for breathing points; adaptive rests by section.\newline
\textbf{Production workflows.} Arrangement gap suggestion for density control and tension sculpting.\newline
\textbf{Cross-cultural composition.} Preset-guided starting points refined with expert feedback.

\section{Implementation Details}
The analyzer, candidate generation, and metric evaluation are pure Python and operate in linear time in sequence length. Optional transformer components are lazily loaded and fully guarded; the system defaults to the lightweight hybrid when unavailable. The API exposes controls (density, weights) through a simple JSON schema; our CLI scripts provide one-command reproduction for experiments, figures, and audio. All figures included here were generated by these scripts against the released CSV. An anonymized code repository link will be provided upon acceptance.

\section{Failure Cases}
Our analysis reveals characteristic edge cases: (i) dense scalar runs with uniform metrical accents can over-encourage weak-beat rests unless groove emphasis is increased; (ii) anacruses and pickup notes may be misinterpreted as mid-phrase positions without explicit up-beat handling; (iii) highly syncopated lines benefit from stricter density caps to avoid local clustering of rests. In practice, increasing $w_G$ and reducing the density cap mitigates (i, iii), while a simple pickup detector addresses (ii). These analyses highlight the need for adaptive metrical priors and genre-specific density tuning, directions we plan to explore.

\section{Supplementary Materials and Artifacts}
We provide: (i) CSV of results with per-instance metrics; (ii) figures auto-generated from CSV; (iii) MIDI and WAV assets per melody\,$\times$\,method\,$\times$\,context under \texttt{assets/}; (iv) API/CLI for prediction and batch analysis. Cultural and context-aware considerations motivate our preset design \citep{jordanous2020creativity,passmore2023diversity}. See Appendix for concise reproduction steps.

\section{Ethics and Cultural Considerations}
We aim for respectful cultural adaptation: presets are conservative defaults, not replacements for expertise. Audio artifacts disclose processing; any generative assistance will be acknowledged in camera-ready. If user data is collected, we will obtain consent, minimize retention, and anonymize. We welcome expert feedback to refine presets and mitigate cultural bias.

\section{Limitations and Future Work}
Current evaluation focuses on monophonic symbolic inputs; extending to multi-voice textures and audio-first settings is future work. We also plan larger, preregistered listener studies with genre stratification, DAW integration for production workflows, cadence-aware detectors, and UI prototypes for parameter steering and explanation browsing. Finally, we will study personalization, learning user-specific rest preferences over time.

\section{Conclusion}
We introduced a human-centered, interpretable system that treats silence as a controllable, culturally aware compositional element. By elevating rests to first-class outputs, our method enables musicians and researchers to co-create musical space intentionally, balancing rhythmic variety, groove, and structure. Beyond its technical formulation, this work reframes silence as a creative decision rather than absence, inviting new exploration in composition, pedagogy, and AI-driven performance.

\acks{We thank musicians and reviewers whose feedback shaped the co-creative focus of this work.}
\clearpage
\bibliography{eaim}



\end{document}
