A Systematic Study of Behavioral Cloning for Scientific Data Annotation

Ishaan Singh Chandok; Core Francisco Park

A Systematic Study of Behavioral Cloning for Scientific Data Annotation

Ishaan Singh Chandok, Core Francisco Park

Published: 02 Mar 2026, Last Modified: 13 Mar 2026ICLR 2026 Workshop MM Intelligence PosterEveryoneRevisionsCC BY 4.0

Track: long paper (up to 8 pages)

Keywords: behavioral cloning, scientific annotation, imitation learning, autoregressive models, multi-task pretraining, synthetic benchmarks

TL;DR: Behavioral cloning can train models to imitate how human experts annotate scientific data, and this paper introduces synthetic tasks to systematically study when and why this approach succeeds or fails.

Abstract: Scientific data annotation, such as tracking animals in video or proofreading neural reconstructions, remains bottlenecked by the ``last mile'' problem: even with strong automation, verification and correction consume substantial human effort. Standard approaches train models to directly predict annotations, discarding the rich supervision in how experts navigate, click, verify, and correct. We introduce a framework for studying behavioral cloning on scientific annotation: 9 synthetic tasks paired with synthetic annotations that simulate realistic human strategies including exploration, mistake correction, and strategic decision-making. Our experiments reveal several findings. First, skills emerge hierarchically: models learn GUI mechanics before task-critical decisions, and commit fewer mistakes than the training data while retaining the ability to correct errors when they occur. Second, scaling models on multi-task behavioral cloning shows that larger models are more data efficient, but exhibit worse decision-making despite similar placement accuracy. Third, multi-task pretraining enables efficient fine-tuning to new tasks, while training from scratch fails entirely. Fourth, linear probes reveal that models internally represent latent variables of the annotation process such as task phase and data position; interestingly, we find a shared mistake representation that generalizes across different annotation tasks. Overall, our framework establishes systematic benchmarks and identifies key bottlenecks, providing a foundation for scaling behavioral cloning to real-world scientific data annotation.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 27

Loading