Keywords: Plasmid, DNA, Language model, RL, LLM
TL;DR: We introduce a generative DNA language model designed for natural language plasmid generation
Abstract: Generative DNA models are typically next-token
completers: they extend a sequence but offer
no native interface for telling the model what to
make. PlasmidLM is a promptable DNA language
model for plasmids. A designer supplies a human-
readable component specification, for example
a high-copy E. coli vector with kanamycin re-
sistance and an EGFP reporter, and the model
generates the corresponding multi-kilobase con-
struct in a single autoregressive pass. Prompts
are unordered sets of named-part tokens at the
granularity of biological shorthand, not learned la-
tent codes or rigid grammars. We evaluate outputs
along two axes: a sequence is viable if structurally
plausible as a plasmid, and faithful if its detected
components match the prompt. Their conjunc-
tion is the useful-plasmid rate, the primary metric
we report. On a held-out 1,000-prompt bench-
mark, the post-trained model achieves a useful-
plasmid rate of 48.5% at single-shot decoding
and 89.7% under best-of-4 sampling. Verifiable-
reward post-training with GRPO against a 660-
entry sequence motif registry improves the useful-
plasmid rate across all sampling budgets. We
release the 19.3M-parameter model, evaluation
suite, and a paired benchmark of prompt-sequence
pairs.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 218
Loading