Contextualizing Biological Language Models across Modalities via Logit-Space Contrastive Alignment

Published: 28 May 2026, Last Modified: 28 May 2026GenBio 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: biological language models, contrastive learning, logit-space alignment, variant ranking, protein-ligand binding, TCR-peptide, mutation scoring
Abstract: Pretrained biological language models expose per-token probability distributions through masked-token prediction, providing the likelihood interface central to sequence design, variant scoring, and mechanistic interpretation. Yet these distributions are learned from broad unlabeled corpora and are not naturally conditioned on task-specific biological contexts such as interaction partners, cellular environments, or therapeutic interventions. Existing contextual matching methods distort this interface with pooled embeddings, contrastive latent spaces, or task-specific prediction heads. We introduce LogiCA (Logit-space Contrastive Alignment), a framework for context-conditioned prediction that adapts pretrained biological language models by performing contrastive learning directly in output-logit space. Using gated cross-modal adapters compatible with each model’s native token head, LogiCA preserves the pretrained per-token likelihood interface and converts contextualized token log-likelihoods into matching scores. Alignment is therefore defined through context-sensitive token probabilities rather than proximity in a shared embedding space. This enables learning from sparse paired data across models with distinct vocabularies, without requiring a shared tokenizer, decoder, or embedding space. LogiCA is particularly effective for mutation-local variant ranking, where variant comparisons reduce to context-conditioned likelihoods of mutant tokens at perturbed sites. Across protein–ligand binding, TCR–peptide activity, and drug-conditioned resistance prediction, LogiCA yields substantial gains over prior state-of-the-art methods, including matched latent-contrastive and conditional-MLM baselines, while retaining a token-level interface for interpretation and generation. On held-out-gene single-mutation drug-resistance prediction, LogiCA improves the AUC from the near-random latent-space baselines of ~0.55 to ~0.65.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 123
Loading