Trustless Feedback: Preserving Privacy and Integrity in Crowdsourced RLHF via Zero-Knowledge Proofs

Published: 06 Apr 2026, Last Modified: 06 Apr 2026ZABAPAD 2026 PosterEveryoneRevisionsCC BY 4.0
Keywords: Reinforcement Learning from Human Feedback, Zero-Knowledge Proofs, Privacy-Preserving Machine Learning, Data Integrity, Verifiable Crowdsourcing
Abstract: Reinforcement Learning from Human Feedback (RLHF) is essential for aligning Large Language Models (LLMs) with human intent. However, in high-stakes domains such as healthcare and precision engineering, expert contributors are often reluctant to share raw data due to strict privacy regulations and intellectual property concerns. Existing decentralized approaches, while preserving privacy, fail to guarantee the computational integrity of the feedback, leaving models vulnerable to data poisoning and lazy annotation. In this work, we propose Trustless Feedback, a novel protocol that leverages Zero-Knowledge Proofs (ZK-SNARKs) to verify the validity of preference labels without revealing the underlying sensitive inputs. We formalize the expert evaluation process as an arithmetic circuit enforcing both hard safety constraints (e.g., drug interactions, geometric collisions) and soft utility scores. We demonstrate the feasibility of our approach through a prototype implementation for engineering design validation, showing that our protocol can mathematically guarantee 100% rejection of invalid data while maintaining tractable proof generation times.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 10
Loading