Understanding Protein-DNA Interactions by Paying Attention to Protein and Genomics Foundation Models

Dhruva Rajwade; Erica Wang; Aryan Satpathy; Alexander Brace; Hongyu Guo; Arvind Ramanathan; Shengchao Liu; Anima Anandkumar

Understanding Protein-DNA Interactions by Paying Attention to Protein and Genomics Foundation Models

Dhruva Rajwade, Erica Wang, Aryan Satpathy, Alexander Brace, Hongyu Guo, Arvind Ramanathan, Shengchao Liu, Anima Anandkumar

Published: 13 Oct 2024, Last Modified: 01 Dec 2024AIDrugX PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Protein-DNA interactions, Cross-Attention, Binding map prediction, Finetuning, Protein language models, Genomics language models

TL;DR: We use protein and DNA foundation models coupled with a cross-attention module to infer protein-DNA binding at a single amino acid and single nucleotide resolution.

Abstract: Protein-nucleic acid (NA) interactions are key in controlling gene regulation. There lies a strong motivation in understanding these interactions, with a goal of engineering these interactions to solve biological problems. Current methods to quantify protein-nucleic acids are mainly experimental and require much time and money. To mitigate this, Deep learning methods have recently been applied to predict Protein-DNA contacts. Although promising, these methods are computationally expensive and face challenges in accuracy. To address these challenges, we propose Seq2Contact, a novel method to predict the protein-NA binding at a single nucleotide (DNA) and single amino acid (Protein) level. Seq2Contact is built on protein and DNA foundation models to obtain nucleotide and amino acid-specific embeddings and then introduces a cross-attention module to obtain the binding contact maps. We employ a sequence-similarity-based clustering method to split the train-test data and empirically illustrate that Seq2Contact can achieve state-of-the-art performance, beating existing baselines by almost 20% (F1-Score) for Protein-NA binding prediction. Our method is computationally more efficient, with up to 80% less memory cost and more than 90% less inference time. Code is available at https://github.com/DhruvaRajwade/Seq2Contact

Submission Number: 44

Loading