% Appendix-ready LaTeX snippet for the EMO-STA family-adaptation overview.

\subsection{How We Adapted Each Benchmark Family for EMO-STA}
\label{app:emo-sta-family-adaptations}

For every benchmark family, we converted the original single-task setup into a small collection of related subtasks that share one evolving artifact, one evaluator family, and the same shared-then-adapt workflow. In the shared phase, the evolving program is optimized against the average score across the family. We then spawn task-specific continuations from the shared archive and compare them against direct single-task baselines with matched interfaces and budgets. The main design goal in each adaptation is to expose a reusable structural prior that can be discovered during shared optimization, while still leaving enough task-specific variation for adaptation to matter. In this appendix summary, we describe all of the EMO-STA benchmark families used in our study except \texttt{r\_robust\_regression} and the original \texttt{k\_module\_problem} variant, which we omit for brevity.

\paragraph{Function minimization.}
We adapted the original standalone \texttt{SinCosXY} example into a four-task family of public two-dimensional objectives: \texttt{SinCosXY}, \texttt{Ackley}, \texttt{Rastrigin}, and \texttt{Rosenbrock}. Rather than allowing the evolving code to adapt to a single named landscape, the EMO-STA version requires one generic derivative-free optimizer that receives \texttt{objective\_fn} and \texttt{bounds} from the evaluator. We also translate the benchmark functions before evaluation so that the exact optima are not trivially exposed to the candidate. This turns the original benchmark into a family-level optimization problem over reusable search heuristics rather than a single hardcoded objective.

\paragraph{Signal processing.}
We adapted the original signal-processing benchmark, which evaluated a single algorithm across several synthetic signal types, into four explicit EMO-STA subtasks: trend+sine, multifrequency, chirp, and step changes. All tasks use the same causal interface, \texttt{process\_signal(noisy\_signal, window\_size)}, and each subtask fixes its own signal length and noise regime. The candidate observes only the noisy input signal and never sees the clean target, task identifier, or generating formula. We excluded the random-walk case from the EMO-STA family so that the shared tasks remain closely related while still exhibiting distinct denoising and trend-recovery behavior.

\paragraph{Circle packing.}
We adapted the original fixed-\(n\) AlphaEvolve-style unit-square circle-packing benchmark into a nearby family over circle counts. The EMO-STA training tasks use \(n \in \{20, 22, 24, 26\}\), all through a shared \texttt{construct\_packing(n)} interface, so the evolving code must implement a reusable packing strategy rather than a solver tailored to one specific \(n\). A key design choice is that the evaluator uses known per-task reference totals \texttt{target\_sum\_radii}, and the shared phase scores candidates by the average normalized target ratio \(\texttt{sum\_radii} / \texttt{target\_sum\_radii}\) across tasks. This is important because absolute sums of radii are not directly comparable across different \(n\): larger \(n\) naturally permits larger totals, so averaging raw sums would not define a coherent shared objective and would bias evolution toward larger-circle-count tasks. The known per-\(n\) targets instead put the tasks on a common scale and encourage the shared phase to discover packing logic that is simultaneously close to each task-specific optimum. We also defined evaluation-only holdouts at \(n \in \{21, 23, 25\}\) to measure whether the shared representation transfers to unseen but nearby problem sizes.

\paragraph{Circle packing in rectangles.}
We also define a second circle-packing EMO-STA family that keeps the same shared constructive structure but changes the container geometry. In this family, the evolving program still implements a generic \texttt{construct\_packing(n)} interface, but it must now choose a rectangle width \(\alpha\), use height \(2-\alpha\), and pack circles into a perimeter-4 rectangle whose aspect ratio is itself part of the optimization problem. The four public tasks use \(n \in \{20, 21, 22, 23\}\). Although each individual task still seeks to maximize the sum of radii subject to the rectangle constraints, the shared phase again uses known task-specific \texttt{target\_sum\_radii} values and averages normalized target ratios rather than raw sums. That normalization is essential here because attainable absolute sums vary with both \(n\) and the geometry of the best rectangle, so raw summed radii do not provide a useful family-level signal for shared evolution. Using known targets makes the tasks commensurate and encourages the shared phase to learn geometry-aware packing behavior that transfers across nearby \(n\) values and across changing aspect ratios.

\paragraph{Heilbronn triangle.}
We adapted the Heilbronn triangle benchmark into a four-task EMO-STA family over nearby point counts inside one fixed canonical unit-area triangle with vertices \((0,0)\), \((2,0)\), and \((0,1)\). The evolving code uses a generic \texttt{construct\_points(n)} / \texttt{run\_heilbronn(n)} interface and must maximize the minimum triangle area induced by all triples of points. As in the circle-packing families, the evaluator uses known task-specific reference values, here \texttt{target\_min\_area}, and shared evolution optimizes the average normalized target ratio \(\texttt{min\_triangle\_area} / \texttt{target\_min\_area}\) across tasks rather than raw minimum areas. This is important because the attainable optimum changes substantially with \(n\): as more points are packed into the same canonical triangle, the best achievable minimum area decreases, so averaging raw areas would bias the shared objective toward easier smaller-\(n\) tasks. The normalization instead puts the \(n \in \{9, 10, 11, 12\}\) tasks on a common scale and encourages the shared phase to learn reusable geometric placement structure across nearby sizes before adaptation retunes the construction for one exact value of \(n\).

\paragraph{K-module balanced.}
We adapted the original public 4-module, 5-option K-module problem into a harder hidden-family EMO-STA benchmark. The EMO-STA version uses 6 modules with 6 opaque options each and defines four hidden target tasks. The family is constructed so that each task agrees with the shared consensus on exactly half of the modules, which creates a shared optimum that is genuinely useful but not identical to any one task-specific optimum. We intentionally hide both task identities and target configurations from prompts and artifacts, so the evolving code must learn a generic compositional strategy that can later be retuned for one specific hidden target.

\paragraph{Symbolic regression.}
We adapted the original generated-per-problem LLM-SRBench symbolic-regression workflow into a single narrow EMO-STA family built from the \texttt{phys\_osc} subset. Instead of generating a separate program, evaluator, and configuration for each equation, the EMO-STA version uses one shared evaluator and one generic interface over four oscillator tasks (\texttt{PO11}, \texttt{PO17}, \texttt{PO30}, and \texttt{PO37}). All tasks expose the same input variables \((x, t, v)\) and the same target \texttt{dv\_dt}. The evaluator provides only numerical datasets through this interface and does not reveal the ground-truth equations, so the shared phase must discover reusable modeling structure rather than memorize symbolic task-specific formulas.

\paragraph{SLDBench-3D.}
We adapted the original seven-task SLDBench benchmark into a two-task EMO-STA subset containing \texttt{vocab\_scaling\_law} and \texttt{data\_constrained\_scaling\_law}. Both tasks are canonicalized to the same three-column schema, \([\texttt{model\_size\_like}, \texttt{diversity\_like}, \texttt{total\_data\_like}]\), so the evolving code learns one reusable law form and one fitting routine instead of separate task-specific raw schemas. During evaluation, group-local coefficients are always refit on the local data, which means EMO-STA shares the law code and fitting procedure, not fitted coefficients. This preserves the intended scaling-law structure while making the family amenable to shared optimization.

\paragraph{Rust adaptive sort.}
We adapted the original standalone Rust sorting benchmark into four deterministic regimes: random, nearly sorted, reverse sorted, and duplicates. The EMO-STA evaluator compiles each candidate implementation once and then benchmarks it on task-specific datasets drawn from the selected regime, so one \texttt{adaptive\_sort} implementation is reused across all tasks. This forces the shared phase to discover broadly useful control logic for switching among sorting behaviors, while still allowing task-specific adaptation to calibrate the implementation to one regime. We intentionally omitted the original \texttt{partially\_sorted} regime from the initial family and reserve it as a possible holdout or generalization check.
