Keywords: Genome, Foundation Model, Library
TL;DR: The first integrated Python library for tuning, deploying and interpreting genomic models.
Abstract: We introduce Genome-Factory, the first integrated Python library for tuning, deploying, and interpreting genomic models.
Our core contribution is to simplify and unify the workflow for genomic model development: data collection, model tuning, inference, benchmarking, and interpretability.
For data collection, Genome-Factory offers an automated pipeline to download genomic sequences and preprocess them.
It also includes quality control like GC content normalization.
For model tuning, Genome-Factory supports three approaches: full-parameter, low-rank adaptation, and adapter-based fine-tuning.
It is compatible with a wide range of genomic models.
For inference, Genome-Factory enables both embedding extraction and DNA sequence generation.
For benchmarking, we include two existing benchmarks and provide a flexible interface for users to incorporate additional benchmarks.
For interpretability, Genome-Factory introduces the first open-source biological interpreter based on a sparse auto-encoder.
This module disentangles embeddings into sparse, near-monosemantic latent units and links them to genomic features by regressing on external readouts.
To improve accessibility, Genome-Factory offers a zero-code command-line and a user-friendly web interface.
We validate the utility of Genome-Factory across three dimensions:
(i) Compatibility with diverse models and fine-tuning methods;
(ii) Benchmarking downstream performance using two open-source benchmarks;
(iii) Biological interpretation of learned representations with DNABERT-2.
These results highlight its end-to-end usability and practical value for real-world genomic analysis.
Supplementary Material: zip
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 642
Loading