Enhanced BioT5+ for Molecule-Text Translation: A Three-Stage Approach with Data Distillation, Diverse Training, and Voting Ensemble

Published: 06 Jul 2024, Last Modified: 28 Jul 2024Language and Molecules ACL 2024 OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: BioT5+, data distillation, ensemble learning, molecule-text translation
Abstract: This paper presents our enhanced BioT5+ method for the Language + Molecules shared task at the ACL 2024 Workshop. The task involves ``translating'' between molecules and natural language, including molecule captioning and text-based molecule generation using the \textit{L+M-24} dataset. Our method consists of three stages. In the first stage, we distill data from various models. In the second stage, combined with \textit{extra} version of the provided dataset, we train diverse models for subsequent voting ensemble. We also adopt Transductive Ensemble Learning (TEL) to enhance these base models. Lastly, all models are integrated using a voting ensemble method. Experimental results demonstrate that BioT5+ achieves superior performance on \textit{L+M-24} dataset. On the final leaderboard\footnote{\url{https://language-plus-molecules.github.io/\#leaderboard}}, our method (team name: \textbf{qizhipei}) ranks \textbf{first} in the text-based molecule generation task and \textbf{second} in the molecule captioning task, highlighting its efficacy and robustness in translating between molecules and natural language. The pre-trained BioT5+ models are available at \url{https://github.com/QizhiPei/BioT5}.
Archival Option: The authors of this submission want it to appear in the archival proceedings.
Submission Number: 5
Loading