OrgAccess: A Benchmark for Role‑Based Access Control in Organization Scale LLMs

07 Apr 2025 (modified: 30 Oct 2025)Submitted to NeurIPS 2025 Datasets and Benchmarks TrackEveryoneRevisionsBibTeXCC BY-NC-ND 4.0
Keywords: Organization Scale LLMs, Role‑Based Access Control (RBAC), Permission levels, LLMs
TL;DR: We release the first dataset that captures diverse, organization-level access-control policies for large-scale LLMs.
Abstract: Role-based access control (RBAC) and hierarchical structures are foundational to how information flows and decisions are made within virtually all organizations. As the potential of Large Language Models (LLMs) to serve as unified knowledge repositories and intelligent assistants in enterprise settings becomes increasingly apparent, a critical, yet underexplored, challenge emerges: can these models reliably understand and operate within the complex, often nuanced, constraints imposed by organizational hierarchies and associated permissions? Evaluating this crucial capability is inherently difficult due to the proprietary and sensitive nature of real-world corporate data and access control policies . To address this significant barrier and provide a realistic testbed, we collaborated extensively with professionals from diverse organizational structures and backgrounds to develop a novel, synthetic yet representative benchmark. This benchmark meticulously defines 40 distinct types of permissions commonly relevant across different organizational roles and levels. It is designed to test LLMs' ability to accurately assess these permissions and generate responses that strictly adhere to the specified hierarchical rules, particularly in scenarios involving users with overlapping or conflicting permissions—a common source of real-world complexity. We rigorously evaluated LLMs across various sizes and providers on this benchmark to provide a detailed report on model performances on the benchmark. Surprisingly, our findings reveal that even state-of-the-art LLMs struggle significantly to maintain compliance with role-based structures, even with explicit instructions, with their performance degrades further when navigating interactions involving two or more conflicting permissions. Specifically, even \textbf{GPT-4.1 only achieves an F1-Score of 0.27 on our hardest benchmark}. This demonstrates a critical limitation in LLMs' complex rule following and compositional reasoning capabilities beyond standard factual or STEM-based benchmarks, opening up a new paradigm for evaluating their fitness for practical, structured environments. Our benchmark thus serves as a vital tool for identifying weaknesses and driving future research towards more reliable and hierarchy-aware LLMs.
Croissant File: json
Dataset URL: https://huggingface.co/datasets/respai-lab/orgaccess
Code URL: https://github.com/respailab/orgaccess
Supplementary Material: pdf
Primary Area: Datasets & Benchmarks for applications in language modeling and vision language modeling
Submission Number: 50
Loading