Optimizing Genomic Language Models for Efficient Training, Fine-Tuning, and Inference

Published: 02 Mar 2026, Last Modified: 08 May 2026MLGenX 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Genomic language models are widely used as general-purpose encoders for sequences, enabling transfer to tasks such as regulatory-element prediction, variant scoring, and genome-wide annotation. In practice, however, both deployment and adaptation are frequently constrained by GPU memory and throughput, especially for long inputs, heterogeneous sequence lengths, and larger checkpoints. We present a unified, efficiency-oriented implementation for inference and fine-tuning of two encoder-only model families: GENA-LM (110M, 336M) and Nucleotide Transformer (500M, 2.5B). Our framework combines (i) IO-aware attention kernels (FlashAttention) that avoid materializing the full attention matrix, (ii) padding-free variable-length execution via sequence packing, (iii) token-budget batching to stabilize utilization under mixed lengths, and (iv) parameter-efficient adaptation using LoRA. We additionally support optional memory knobs, including activation checkpointing, ZeRO stage-2 CPU offload, and 4-bit/8-bit quantization. Across four model sizes, we validate numerical fidelity via pseudo-perplexity agreement with a reference implementation, characterize inference memory and throughput scaling with sequence length and batch size, and isolate training-time memory contributions in a progressive ablation. On downstream binary classification benchmarks (promoter and enhancer prediction), the efficient implementation yields considerable fine-tuning speedups and peak-memory reductions while maintaining broadly comparable task performance. Finally, we quantify quantization-induced drift by measuring embedding cosine similarity across model sizes, highlighting model-dependent trade-offs.
Submission Number: 49
Loading