Synergistic Benefits of Joint Molecule Generation and Property Prediction

Published: 05 Jan 2026, Last Modified: 05 Jan 2026Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Modeling the joint distribution of data samples and their properties allows to construct a single model for both data generation and property prediction, with synergistic benefits reaching beyond purely generative or predictive models. However, training joint models presents daunting architectural and optimization challenges. Here, we propose Hyformer, a transformer-based joint model that successfully blends the generative and predictive functionalities, using an alternating attention mechanism and a joint pre-training scheme. We show that Hyformer is simultaneously optimized for molecule generation and property prediction, while exhibiting synergistic benefits in conditional sampling, out-of-distribution property prediction and representation learning. Finally, we demonstrate the benefits of joint learning in a drug design use case of discovering novel antimicrobial peptides.
Certifications: J2C Certification
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: 1. We report mean and standard deviation over 3 random seeds in Table 1, 4, 8 and confidence intervals in Table 6. Moreover, in Table 2 and Table 5, we revised bolding of the results to include statistical significance at the 95 % confidence level. We modified the corresponding captions and the discussion of the results accordingly. We used Welch's t-test as our statistic and intentionally left bolding in Tables 1, 3, 4 and 6 unchanged, as competing methods did not report standard deviation. 2. We added an explicit list of task and metric definitions used throughout the paper in Benchmark Task Definitions (Appendix F) and Benchmark Task Metrics (Appendix G). Moreover, we corrected metric names: FCD and Kl Div. to FCD score and Kl Div. score, respectively, and resolved issues with "Diversity" in section 5.3. 3. We corrected the following typos: - “the the” repetition noted by the reviewer - we replaced the word "pre-train" with "jointly pre-train" throughout Section 5 - we have corrected the typo in Section 5.3, changing “Arginine (R) and Arginine (K)” to “Arginine (R) and Lysine (K)” - we corrected caption in Figure 2 corrected to "textbf{(a)} Amino-acid distributions between the pre-training data and unconditionally generated sequences.". 4. We added dataset version or acquisition dates in Appendix F and H. 5. We added justification for the higher than expected ClinTox scores (99.5) in Table 3. 6. We added lacking information of the generation process of the AMPs in Appendix I.6 7. We added an extended novelty statement in Appendix C. 8. We added acceptance rates and predictor calibration for conditional sampling in Table 9 and a corresponding discussion in Appendix I.1 9. We extended the Discussion with framing property prediction as a purely generative task and molecular interaction considerations in Appendix C. 10. We replaced Lemma 4.1 with: "As a simple consequence of the Bayes rule, the above procedure yields a correct conditional sampling procedure, as $$p(\mathbf{x} \mid y \in Y) \propto p(y \in Y \mid \mathbf{x})p(\mathbf{x}),$$ for $y \in Y \subseteq \mathcal{Y}$ such that $p(y \in Y) >0$." Additionally, we removed the proof of Lemma 4.1 from the Appendix. 11. We included Acknowledgments in Appendix B We thank the Reviewers for all the constructive suggestions that helped improve the manuscript.
Code: https://github.com/szczurek-lab/hyformer
Assigned Action Editor: ~Adam_Arany1
Submission Number: 5773
Loading