Token-Only Adaptation of Frozen Self-Supervised Vision Foundation Models for Cross-Species Animal Pose: A Pareto-Frontier Characterization Across Eight Held-Out Mammal Species

Ethan Y Wang; Aayan Alwani

Token-Only Adaptation of Frozen Self-Supervised Vision Foundation Models for Cross-Species Animal Pose: A Pareto-Frontier Characterization Across Eight Held-Out Mammal Species

Ethan Y Wang, Aayan Alwani

Published: 28 May 2026, Last Modified: 28 May 2026GenBio 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: generative AI for biology, vision foundation models, animal pose estimation, cross-species generalization, parameter-efficient fine-tuning, frozen backbone, DINOv2, AP-10K, Pareto frontier, behavioral biology, ethology

TL;DR: Token-only adaptation of a frozen vision model matches full fine-tuning on 3/8 held-out species at 10,000× fewer parameters, enabling cheap cross-species animal pose estimation.

Abstract: Markerless animal pose estimation is the core generative-AI-for-biology instrument of quantitative ethology and behavioral neuroscience, with toolkits such as DeepLabCut, SLEAP, and SuperAnimal routinely deployed across dozens of species per lab. The dominant adaptation strategy in the wild — full or partial fine-tuning of a backbone per target species — costs millions of parameter updates per identity, which limits how cheaply a field lab can stand up a model for a new species in a new ecological setting and forecloses many-species deployment scenarios in cross-species behavioral studies and field surveys. We ask when token-only adaptation of a frozen vision foundation model suffices, on AP-10K (54 mammal species; 17-keypoint shared parent skeleton) as a controlled cross-species testbed in which identity is well-defined (species), the backbone is shared across identities (frozen DINOv2-base, 86 million parameters, never updated), and the downstream task is dense keypoint regression. We propose Identity-Token Adaptation: a per-identity learned token of dimension 768 conditions a small cross-attention decoder over frozen patch features; at inference on a held-out species, the primitive updates only the per-identity token and an optional small subset of decoder weights. Within-species PCK at 0.05 normalized image diameter is 0.611 with a 7.94 million parameter decoder. On eight held-out species at 10 random seeds spanning maximum-cosine identity distance 0.46 to 0.76, token-only adaptation produces statistically significant root-mean-squared-error reductions over the no-adapt mean-token baseline; all eight species' 95% paired-bootstrap confidence intervals exclude zero. On three species (rabbit, fox, panther) the gain meets a pre-registered within-15% threshold against the no-adapt baseline. On fox the 768-parameter Identity-Token Adaptation point sits at a 1.03x RMSE ratio against an in-house decoder fine-tune (300 SGD steps, 8.5 million trainable parameters), within the pre-registered within-15% tie threshold and approximately 11,118x fewer trainable parameters per held-out species. Far-from-training species recover roughly a third to a half of decoder-fine-tune PCK at 0.05, consistent with a predicted floor effect. The adaptation gain is non-monotone in cosine identity distance: rabbit at cosine 0.55 gives the largest gain, while far-from-training primates show the smallest gains. Two architectural choices are load-bearing under ablation: cross-attention identity injection (the FiLM-only variant gives a null adaptation signal) and a token-utility margin auxiliary loss (without it, per-coordinate prediction standard deviation across random tokens collapses to noise floor). A direct test of the hypothesis that learned-interpolation initialization beats random initialization in the zero- and one-shot regime is falsified across three head architectures, identifying an information-bottleneck pattern in the frozen-backbone, shared-decoder design at our pilot scale that we honestly preserve. The intended deployment use is direct: a behavioral or conservation lab can spin up a per-individual or per-species model for under a kilobyte of trainable state, on a laptop, without unfreezing a backbone and without paying cloud compute costs. We do not in this submission run head-to-head comparisons against LoRA, AdaptFormer, BitFit, IA-cubed, or VPT on the same eight-species grid; adding the parameter-efficient-fine-tuning baseline panel is the highest-leverage missing experiment. The full evaluation runs on Mac MPS at zero cloud cost; code, configurations, and trained weights are released to enable wild deployment in field-lab settings without specialized hardware.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 151

Loading