Entrophy: User Interaction Data from Live Enterprise Workflows for Realistic Model Evaluation

ICLR 2026 Conference Submission21183 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Human-computer interactions, Enterprise AI, digital interactions, multimodal data, automation
TL;DR: We introduce a high-fidelity dataset of digital interactions derived from real, complex business workflows, designed to enable more realistic training and evaluation of AI models.
Abstract: AI-driven automation for complex enterprise workflows faces significant hurdles due to the lack of publicly available datasets that realistically capture how business processes unfold - interaction by interaction - within actual production environments. Existing datasets are typically synthetic, confined to sandbox settings, or restricted to short web-based processes, limiting the preparedness of AI models for real-world complexities encountered in finance, legal, HR, and other critical domains. To bridge this gap, we introduce $\textbf{\texttt{ENTROPHY}}$, the first openly available dataset capturing detailed, end-to-end recordings of authentic enterprise processes. Experienced finance, legal, and HR professionals conducted 283 real-world workflow executions, totaling 33 hours of interactive activity across 19 diverse platforms spanning modern SaaS tools, web pages, and legacy desktop software. Each digital interaction is comprehensively logged alongside rich UI context and visual screen captures. Crucially, $\textbf{\texttt{ENTROPHY}}$ captures not just structured process flows (and the overlap between them), but also the authentic, often messy dynamics of human work: multitasking, interruptions, off-process behaviors, and natural variability across users. By emphasizing fine-grained user interactions as a primary data modality, $\textbf{\texttt{ENTROPHY}}$ provides a foundation for building AI systems capable of handling the nuances of real-world work in enterprise environments. As a first application, we benchmark frontier language models on workflow classification and boundary-accurate stream segmentation tasks, both central to enterprise automation, revealing substantial headroom for improvement. We make the dataset available at: https://www.kaggle.com/datasets/94647fd0bb51dff501a463674a2314627cdaf8c76d41b093c333b608459e017e.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 21183
Loading