OpenReview
.net
OpenReview
.net
Login
OpenReview
.net
Login
Back to
ICLR
ICLR 2026 Workshop Trustworthy AI Submissions
Loading
Test-Time Training Undermines Existing Safety Guardrails
Simone Antonelli
,
Mohammad Sadegh Akhondzadeh
,
Aleksandar Bojchevski
Published: 02 Mar 2026, Last Modified: 05 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
OmniPatch: A Universal Adversarial Patch for ViT-CNN Cross-Architecture Transfer in Semantic Segmentation
Aarush Aggarwal
,
Akshat Tomar
,
Amritanshu Tiwari
,
Sargam Goyal
Published: 02 Mar 2026, Last Modified: 11 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
The Rogue Scalpel: Activation Steering Compromises LLM Safety
Anton Korznikov
,
Andrey V. Galichin
,
Alexey Dontsov
,
Oleg Rogov
,
Ivan Oseledets
,
Elena Tutubalina
Published: 02 Mar 2026, Last Modified: 07 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
Fairness Failure Modes of Multimodal LLMs
Canyu Chen
,
Anglin Cai
,
Joan Nwatu
,
Jianshu Zhang
,
Yale Li
,
Han Liu
,
Jessica Hullman
,
Rada Mihalcea
,
Kathleen McKeown
,
Manling Li
Published: 02 Mar 2026, Last Modified: 12 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
Efficient Test-Time Scaling of Multi-Step Reasoning by Probing Internal States of Large Language Models
Jingwei Ni
,
Ekaterina Fadeeva
,
Tianyi Wu
,
Mubashara Akhtar
,
Jiaheng Zhang
,
Elliott Ash
,
Markus Leippold
,
Timothy Baldwin
,
See-Kiong Ng
,
Artem Shelmanov
,
Mrinmaya Sachan
Published: 02 Mar 2026, Last Modified: 07 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
Attention Sinks in Diffusion Language Models
Maximo Eduardo Rulli
,
Simone Petruzzi
,
Edoardo Michielon
,
Fabrizio Silvestri
,
Simone Scardapane
,
Alessio Devoto
Published: 02 Mar 2026, Last Modified: 07 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
RedBench: A Universal Dataset for Comprehensive Red Teaming of Large Language Models
Quy-Anh Dang
,
Chris Ngo
,
Truong-Son Hy
Published: 02 Mar 2026, Last Modified: 04 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
Query Circuits: Explaining How Language Models Answer User Prompts
Tung-Yu Wu
,
Fazl Barez
Published: 02 Mar 2026, Last Modified: 04 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
Mitigating Legibility Tax with Decoupled Prover-Verifier Games
Yegon Kim
,
Juho Lee
Published: 02 Mar 2026, Last Modified: 03 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
MONITORING EMERGENT REWARD HACKING DURING GENERATION VIA INTERNAL ACTIVATIONS
Patrick Wilhelm
,
Thorsten Wittkopp
,
Odej Kao
Published: 02 Mar 2026, Last Modified: 04 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
Calibrated Predictive Lower Bounds on Time-to-Unsafe-Sampling in LLMs
Hen Davidov
,
Shai Feldman
,
Gilad Freidkin
,
Yaniv Romano
Published: 02 Mar 2026, Last Modified: 02 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
Few-Shot Adversarial Low-Rank Fine-Tuning of Vision-Language Models
Sajjad Ghiasvand
,
Haniyeh Ehsani Oskouie
,
Mahnoosh Alizadeh
,
Ramtin Pedarsani
Published: 02 Mar 2026, Last Modified: 03 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
RouterInterp: Superposed Specialisation in MoE Routing
Ilya Lasy
,
Nora Yinuo Cai
,
Kola Ayonrinde
Published: 02 Mar 2026, Last Modified: 03 Apr 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
BackFed: A Standardized and Efficient Benchmark Framework for Backdoor Attacks in Federated Learning
Thinh Dao
,
Thuy Dung Nguyen
,
Khoa D Doan
,
Kok-Seng Wong
Published: 02 Mar 2026, Last Modified: 10 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
Secure Code Generation via Online Reinforcement Learning with Vulnerability Reward Model
Tianyi Wu
,
Mingzhe Du
,
Yue Liu
,
Chengran Yang
,
Terry Yue Zhuo
,
Jiaheng Zhang
,
See-Kiong Ng
Published: 02 Mar 2026, Last Modified: 06 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
Efficient Refusal Ablation in LLM through Optimal Transport
geraldin nanfack
,
Elvis Dohmatob
Published: 02 Mar 2026, Last Modified: 10 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
Towards Statistical Verification for Trustworthy AI
Blossom Metevier
,
Max Springer
,
Bohdan Turbal
,
Aleksandra Korolova
Published: 02 Mar 2026, Last Modified: 07 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
Offline Reinforcement Learning of High-Quality Behaviors Under Robust Style Alignment
Mathieu Petitbois
,
Rémy Portelas
,
Sylvain Lamprier
Published: 02 Mar 2026, Last Modified: 03 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
Beyond Idealized Patients: Evaluating LLMs under Challenging Patient Behaviors in Medical Consultations
Yahan Li
,
Xinyi Jie
,
Wanjia Ruan
,
Xubei Zhang
,
Huaijie ZHU
,
Yicheng Gao
,
Ruishan Liu
Published: 02 Mar 2026, Last Modified: 12 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
Investigating Data Interventions for Subgroup Fairness: An ICU Case Study
Erin Tan
,
Judy Hanwen Shen
,
Irene Y. Chen
Published: 02 Mar 2026, Last Modified: 02 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
VisualScratchpad: Inference-time Visual Concepts Analysis in Vision Language Models
Hyesu Lim
,
Jinho Choi
,
Taekyung Kim
,
Byeongho Heo
,
Jaegul Choo
,
Dongyoon Han
Published: 02 Mar 2026, Last Modified: 06 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
Beyond Static Truthfulness Benchmarks: Two Truths and One Lie for Multi-Agent Deception and Detection
Jason Kong
,
Lanxiang Hu
,
Flavio Ponzina
,
Tajana Rosing
Published: 02 Mar 2026, Last Modified: 07 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
BarrierSteer: LLM Safety via Learning Barrier Steering
Thanh Q. Tran
,
Arun Verma
,
Kiwan Wong
,
Bryan Kian Hsiang Low
,
Daniela Rus
,
Wei Xiao
Published: 02 Mar 2026, Last Modified: 07 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
The Realignment Problem: When Right becomes Wrong in LLMs
Aakash Sen Sharma
,
Debdeep Sanyal
,
Manodeep Ray
,
Vivek Srivastava
,
Shirish Karande
,
Murari Mandal
Published: 02 Mar 2026, Last Modified: 07 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
DELTA-CROSSCODER: ROBUST CROSSCODER IN NARROW FINE-TUNING REGIMES
Aly M. Kassem
,
Thomas Jiralerspong
,
Negar Rostamzadeh
,
Golnoosh Farnadi
Published: 02 Mar 2026, Last Modified: 06 Mar 2026
ICLR 2026 Trustworthy AI
Readers:
Everyone
«
‹
1
2
3
4
5
6
›
»