Exploring validation metrics for offline model-based optimisation with diffusion models

Published: 14 Jun 2024, Last Modified: 14 Jun 2024Accepted by TMLREveryoneRevisionsBibTeX
Abstract: In model-based optimisation (MBO) we are interested in using machine learning to design candidates that maximise some measure of reward with respect to a black box function called the (ground truth) oracle, which is expensive to compute since it involves executing a real world process. In offline MBO we wish to do so without assuming access to such an oracle during training or validation, with makes evaluation non-straightforward. While an approximation to the ground oracle can be trained and used in place of it during model validation to measure the mean reward over generated candidates, the evaluation is approximate and vulnerable to adversarial examples. Measuring the mean reward of generated candidates over this approximation is one such `validation metric', whereas we are interested in a more fundamental question which is finding which validation metrics correlate the most with the ground truth. This involves proposing validation metrics and quantifying them over many datasets for which the ground truth is known, for instance simulated environments. This is encapsulated under our proposed evaluation framework which is also designed to measure extrapolation, which is the ultimate goal behind leveraging generative models for MBO. While our evaluation framework is model agnostic we specifically evaluate denoising diffusion models due to their state-of-the-art performance, as well as derive interesting insights such as ranking the most effective validation metrics as well as discussing important hyperparameters.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: # Third revision (26/05/2024) - Minor revisions as per AC's comments and recommendations. # Second revision (24/10/2023) - Updated abstract and introduction for clarity, as well as Figure 2. - Replaced all occurences of 'score' with 'reward' for the $y$ variable, since the former can be confused with the the definition of 'score' in the context of score-based generative models, i.e. $\nabla_{x} \log p(x)$. - Some minor formatting changes. # First revision ## New results - We added classifier guidance results for Superconductor in the main table of results (Table 3). We also added our results for Hopper50 as an extra column - (Note that there are no numbers quoted from Design Bench in this column, since Hopper50 is a modification of the original Hopper dataset to be compatible with our evaluation framework). - The Pearson correlation plots (Figure 5) are now shown for all four datasets. - Re-run classifier guidance (c.g.) experiments for Ant and Kitty due to a bug with the wrong oracle being used during generation. ## Paper clarity - The table describing the five validation metrics was simplified to be more straightforward and reduce space. - A new table was added describing each of the four datasets, e.g. the value of $\gamma$, and how many examples in training and testing. - 4/5 of the validation metrics are now explained in one section. - A substantial change we made is introducing diffusion models at the _very start_ of the paper rather than in the experiments section in the old draft. This is important as we restructured the paper to emphasise that while the evaluation framework proposed is _model-agnostic_, we have specifically chosen to explore diffusion models as they are currently state-of-the-art in many generative modelling tasks and are therefore worth exploring in offline MBO. We have also updated the paper title to reflect this. - In Section 3 (Related Work) we have discussed the relationship between this work and Bayesian optimisation, which is a common approach to online MBO. Most importantly, we have made it clearer what the take-away message is from this work, and this is written in Section 1.2 ("Contributions"), and re-emphasised in Section 4.1 ("Results"), and finally in the conclusion. In summary: - The top three performing metrics in descending order are agreement, Frechet distance, and the validation score. - Empirically, classifier-free guidance and classifier-based guidance perform roughly just as well, but the latter is a convenient parameterisation to use if one wants to view the diffusion model as a form of pre-training for _online MBO_ (which uses for instance Bayesian probabilistic models). - Consistent with other works in generative modelling, the trade-off between sample quality and sample diversity is important, and this is also qualitatively shown via the Pearson correlation scatterplots when we colour points by $w$, the guidance hyperparameter.
Code: https://github.com/christopher-beckham/validation-metrics-offline-mbo
Assigned Action Editor: ~Kevin_Swersky1
Submission Number: 1171