The Latent-First Laboratory: A Manifesto for Efficient, Audit-Based AI Science

Published: 30 May 2026, Last Modified: 30 May 2026ICML2026-AI4Science PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Track 2: Dataset Proposal Competition
TL;DR: We propose an AI-native scientific data repository that stores data in machine-friendly latent form by default, with a small human-audit subset preserved for verification.
Abstract: High-throughput scientific workflows generate petabytes of visual data, historically mandating fully decompressed, human-readable formats. However, standard ``decode-then-classify'' downstream analysis introduces severe computational and storage bottlenecks. As scientific discovery increasingly relies on autonomous AI agents, we argue that maintaining entire datasets in pixel space is an unsustainable artifact of human-centric science. In this position paper, we propose the AI-Native Repository, a paradigm shift where the default state of scientific data is a compressed latent representation optimized for machine comprehension. Prior work suggests that learned or task-oriented compressed representations can preserve much of the semantic information needed for downstream inference, supporting the feasibility of latent-first pipelines. To address the trust gap between human researchers and autonomous systems, we also propose a Human Audit Slice policy, in which a small fraction of uncompressed data is preserved for manual verification and governance. Ultimately, we advocate for a structural shift toward machine-native data architectures, establishing new community standards for efficient, scalable, and verifiable AI-driven science.
Keywords: AI-native repository, latent representations, scientific data management, machine-native workflows, human audit, compressed data, autonomous science
Submission Number: 130
Loading