TASK-RELEVANT FEATURES OUTPERFORM LEARNED REPRESENTATIONS FOR DRUG-MICROBIOME RETRIEVAL

TASK-RELEVANT FEATURES OUTPERFORM LEARNED REPRESENTATIONS FOR DRUG-MICROBIOME RETRIEVAL

04 Feb 2026 (modified: 04 Mar 2026)Submitted to ICLR 2026 Workshop LMRLEveryoneRevisionsBibTeXCC BY 4.0

Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.

Track: tiny / short paper (up to 5 pages)

Keywords: microbiome, representation learning, cross-context generalisation, functional annotation

TL;DR: For drug-microbiome retrieval, biologically-informed features (enzyme profiles) beat a pretrained foundation model — feature choice matters more than model scale.

Abstract: Which representation best organises drugs by mechanism of action from microbiome perturbation data? We compare eight representations spanning two feature types (101 genera, 2,538 enzyme commission numbers), three compression methods (raw, PCA, VAE), and a pretrained foundation model (MGM). Within a single community, no representation significantly outperforms others. In leave-one-community-out cross-validation over 8 communities, the six taxonomy- and EC-based representations are statistically indistinguishable (MAP@10 0.48–0.51 at ATC level 2, all pairwise p > 0.35), while MGM trails significantly (p ≤ 0.02). The compression method does not matter; the input features do. EC profiles are more conserved across communities than taxonomy profiles (Wilcoxon p = 3.5×10−12), yet both feature families outperform the foundation model.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Presenter: ~Saif_Ur-Rehman1

Format: Yes, the presenting author will attend in person if this work is accepted to the workshop.

Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.

Submission Number: 33

Loading