OpenReview
.net
OpenReview
.net
Login
OpenReview
.net
Login
Back to
ICLR
ICLR 2026 Workshop Trustworthy AI Submissions
Loading
Neural Chameleons: Language Models Can Learn to Hide Their Thoughts from Unseen Activation Monitors
Max McGuinness
,
Alex Serrano
,
Luke Bailey
,
Scott Emmons
Published: 02 Mar 2026, Last Modified: 07 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
Verbosity Tradeoffs and the Impact of Scale on the Faithfulness of LLM Self-Explanations
Noah Y. Siegel
,
Nicolas Heess
,
Maria Perez-Ortiz
,
Oana-Maria Camburu
Published: 02 Mar 2026, Last Modified: 06 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
Mitigating Reward Hacking with RL Training Interventions
Aria Wong
,
Joshua Engels
,
Neel Nanda
Published: 02 Mar 2026, Last Modified: 07 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
Patching LLMs Like Software: A Lightweight Method for Improving Safety Policies in Large Language Models
Huzaifa Arif
,
Pin-Yu Chen
,
Keerthiram Murugesan
,
Alex Gittens
,
Payel Das
,
Ching-Yun Ko
Published: 02 Mar 2026, Last Modified: 07 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
Always Keep Your Promises: A Model-Agnostic Attribution Algorithm for Neural Networks
Kevin Lee
,
Duncan Halverson
,
Pablo Andres Millan Arias
Published: 02 Mar 2026, Last Modified: 11 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
Moral Preferences of LLMs Under Directed Contextual Influence
Phil Blandfort
,
Tushar Karayil
,
Urja Pawar
,
Robert Graham
,
Alex McKenzie
,
Dmitrii Krasheninnikov
Published: 02 Mar 2026, Last Modified: 05 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
TrustLDM: Benchmarking Trustworthiness in Language Diffusion Model
Yichuan Mo
,
Yukun Jiang
,
Yanbo Shi
,
Mingjie Li
,
Michael Backes
,
Yang Zhang
,
Yisen Wang
Published: 02 Mar 2026, Last Modified: 03 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
Expert-guided Clinical Text Augmentation via Query-Based Model Collaboration
Dongkyu Cho
,
Miao Zhang
,
Gregory D Lyng
,
Rumi Chunara
Published: 02 Mar 2026, Last Modified: 04 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
Expert Selections In MoE Models Reveal (Almost) As Much As Text
Amir Nuriyev
,
Gabriel Kulp
Published: 02 Mar 2026, Last Modified: 07 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
GuardReasoner-Omni: A Reasoning-based Multi-modal Guardrail for Text, Image, and Video
Zhenhao Zhu
,
Yue Liu
,
Yanpei Guo
,
Wenjie Qu
,
Cancan Chen
,
Yufei He
,
Yibo Li
,
Yulin Chen
,
Tianyi Wu
,
Huiying Xu
,
Xinzhong Zhu
,
Jiaheng Zhang
Published: 02 Mar 2026, Last Modified: 12 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
Training with Honeypots: Reshaping How LLMs Fail
Samuel Simko
,
Punya Syon Pandey
,
Zhijing Jin
,
Bernhard Schölkopf
Published: 02 Mar 2026, Last Modified: 12 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
Tuning Just Enough: Lightweight Backdoor Attacks on Multi-Encoder Diffusion Models
Ziyuan Chen
,
Yujin Jeong
,
Tobias Braun
,
Anna Rohrbach
Published: 02 Mar 2026, Last Modified: 05 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
Sparse Circuits of Vision Language Alignment
Huizhen Shu
,
xuying li
Published: 02 Mar 2026, Last Modified: 05 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
Geometry-Aware Crossover for Effective and Efficient Evolutionary Attacks
Hyo Seo Kim
,
Gang Luo
,
Can Chen
,
Binghui Wang
,
Yue Duan
,
Ren Wang
Published: 02 Mar 2026, Last Modified: 04 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
LoRA Users Beware: A Few Spurious Tokens Can Manipulate Your Finetuned Model
Praney Goyal
,
Marcel Mateos Salles
,
Pradyut Sekhsaria
,
Hai Huang
,
Randall Balestriero
Published: 02 Mar 2026, Last Modified: 03 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
AdaptNC: Adaptive Nonconformity Scores for Uncertainty-Aware Autonomous Systems in Dynamic Environments
Renukanandan Tumu
,
Aditya Singh
,
Rahul Mangharam
Published: 02 Mar 2026, Last Modified: 06 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
Same Question, Different Lies: Cross-Context Consistency (C³) for Black-Box Sandbagging Detection
Lin Yulong
,
Pablo Bernabeu-Perez
,
Benjamin Arnav
,
Lennie Wells
,
Mary Phuong
Published: 02 Mar 2026, Last Modified: 12 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
Position: Beyond Reasoning Zombies — AI Reasoning Requires Process Validity
Rachel Lawrence
,
Jacqueline R. M. A. Maasch
Published: 02 Mar 2026, Last Modified: 06 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
Disentangled Safety Adapters Enable Efficient Guardrails and Flexible Inference-Time Alignment
Kundan Krishna
,
Joseph Yitan Cheng
,
Charles Maalouf
,
Leon Alexander Gatys
Published: 02 Mar 2026, Last Modified: 12 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
Visual Disentangled Diffusion Autoencoders: Scalable Counterfactual Generation for Foundation Models
Sidney Bender
,
Marco Morik
Published: 02 Mar 2026, Last Modified: 06 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
Benchmarking AI Control Protocols for Safety in Medical Question-Answering Tasks
Guido Freire
,
Agustín E. Martínez-Suñé
,
Viviana Cotik
Published: 02 Mar 2026, Last Modified: 10 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
Bootstrapping-based Regularisation for Reducing Individual Prediction Instability in Clinical Risk Prediction Models
Sara Matijevic
,
Christopher Yau
Published: 02 Mar 2026, Last Modified: 06 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear Probes
Iván Vicente Moreno Cencerrado
,
Arnau Padrés Masdemont
,
Anton Gonzalvez Hawthorne
,
David Demitri Africa
,
Lorenzo Pacchiardi
Published: 02 Mar 2026, Last Modified: 02 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
TIGHTENING OPTIMALITY GAP WITH CONFIDENCE THROUGH CONFORMAL PREDICTION
Miao Li
,
Michael Klamkin
,
Russell Bent
,
Pascal Van Hentenryck
Published: 02 Mar 2026, Last Modified: 06 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
Simple LLM Baselines are Competitive for Model Diffing
Elias Kempf
,
Simon Schrodi
,
Bartosz Cywiński
,
Thomas Brox
,
Neel Nanda
,
Arthur Conmy
Published: 02 Mar 2026, Last Modified: 11 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
«
‹
1
2
3
4
5
6
›
»