Track: Tiny Paper Track
Keywords: Phylogenetic classification, Hierarchical softmax, Protein taxonomy, Evolutionary modeling, Deep learning, Transformer models, ESM, Sequence classification, Taxonomic lineage
TL;DR: We introduce HATax-ESM, a hierarchical softmax-based deep learning model that classifies protein sequences according to phylogenetic lineages.
Abstract: Understanding the evolutionary relationships between protein sequences is crucial for phylogenetic classification, mutation prediction, and functional annotation. The NCBI taxonomy database NCBI contains over one million distinct taxa, but many classes have very few representative sequences, creating extreme class imbalance. Traditional sequence similarity-based methods like BLAST are widely used for taxonomic classification but are computationally expensive and ineffective for de novo sequences without close homologs.
Recent deep learning approaches, such as PhyloTransformer and TEMPO, have leveraged transformer-based architectures for phylogenetic tasks. However, these models do not impose explicit hierarchical constraints, limiting their ability to ensure phylogenetic consistency. Inspired by phylogenetic tree-guided learning, we introduce a model that combines a frozen ESM feature extractor with attention pooling and hierarchical softmax-based classification.
HATax-ESM improves computational efficiency while enforcing structured taxonomic predictions. By conditioning classification probabilities at each level, our model ensures predictions follow valid phylogenetic lineages, making it a robust alternative to traditional similarity-based methods. This structured approach allows for better generalization, particularly for underrepresented taxa, enhancing protein classification and evolutionary inference.
Submission Number: 35
Loading