Toggle navigation
OpenReview
.net
Login
×
Back to
ICLR
ICLR 2025 Workshop BuildingTrust Submissions
Enhancing LLM Robustness to Perturbed Instructions: An Empirical Study
Aryan Agrawal
,
Lisa Alazraki
,
Shahin Honarvar
,
Marek Rei
Published: 05 Mar 2025, Last Modified: 03 Apr 2025
BuildingTrust
Readers:
Everyone
Measuring In-Context Computation Complexity via Hidden State Prediction
Vincent Herrmann
,
Róbert Csordás
,
Jürgen Schmidhuber
Published: 05 Mar 2025, Last Modified: 15 Apr 2025
BuildingTrust
Readers:
Everyone
Towards Understanding Distilled Reasoning Models: A Representational Approach
David D. Baek
,
Max Tegmark
Published: 05 Mar 2025, Last Modified: 25 Mar 2025
BuildingTrust
Readers:
Everyone
Understanding (Un)Reliability of Steering Vectors in Language Models
Joschka Braun
,
Carsten Eickhoff
,
David Krueger
,
Seyed Ali Bahrainian
,
Dmitrii Krasheninnikov
Published: 05 Mar 2025, Last Modified: 31 Mar 2025
BuildingTrust
Readers:
Everyone
AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security
Zikui Cai
,
Shayan Shabihi
,
Bang An
,
Zora Che
,
Brian R. Bartoldson
,
Bhavya Kailkhura
,
Tom Goldstein
,
Furong Huang
Published: 05 Mar 2025, Last Modified: 17 Apr 2025
BuildingTrust
Readers:
Everyone
Adaptive Test-Time Intervention for Concept Bottleneck Models
Matthew Shen
,
Aliyah R. Hsu
,
Abhineet Agarwal
,
Bin Yu
Published: 05 Mar 2025, Last Modified: 14 Apr 2025
BuildingTrust
Readers:
Everyone
Endive: A Cross-Dialect Benchmark for Fairness and Performance in Large Language Models
Abhay Gupta
,
Jacob Cheung
,
Philip Meng
,
Shayan Sayyed
,
Austen Liao
,
Kevin Zhu
,
Sean O'Brien
Published: 05 Mar 2025, Last Modified: 06 Mar 2025
BuildingTrust
Readers:
Everyone
Diagnostic Uncertainty: Teaching Language Models to Describe Open-Ended Uncertainty
Brian Sui
,
Jessy Lin
,
Michelle Li
,
Anca Dragan
,
Dan Klein
,
Jacob Steinhardt
Published: 05 Mar 2025, Last Modified: 15 Apr 2025
BuildingTrust
Readers:
Everyone
Maybe I Should Not Answer That, but... Do LLMs Understand The Safety of Their Inputs?
Maciej Chrabaszcz
,
Filip Szatkowski
,
Bartosz Wójcik
,
Jan Dubiński
,
Tomasz Trzcinski
Published: 05 Mar 2025, Last Modified: 09 Apr 2025
BuildingTrust
Readers:
Everyone
LM Agents May Fail to Act on Their Own Risk Knowledge
Yuzhi Tang
,
Tianxiao Li
,
Elizabeth Li
,
Chris J. Maddison
,
Honghua Dong
,
Yangjun Ruan
Published: 05 Mar 2025, Last Modified: 16 Apr 2025
BuildingTrust
Readers:
Everyone
Temporally Sparse Attack for Fooling Large Language Models in Time Series Forecasting
Fuqiang Liu
,
Sicong Jiang
Published: 05 Mar 2025, Last Modified: 01 Apr 2025
BuildingTrust
Readers:
Everyone
Token-Level Adversarial Prompt Detection Based on Perplexity Measures and Contextual Information
Zhengmian Hu
,
Gang Wu
,
Saayan Mitra
,
Ruiyi Zhang
,
Tong Sun
,
Heng Huang
,
Viswanathan Swaminathan
Published: 05 Mar 2025, Last Modified: 06 Mar 2025
BuildingTrust
Readers:
Everyone
A Benchmark for Scalable Oversight Mechanisms
ICLR 2025 Workshop BuildingTrust Submission110 Authors
11 Feb 2025 (modified: 06 Mar 2025)
Submitted to BuildingTrust
Readers:
Everyone
Top of the CLASS: Benchmarking LLM Agents on Real-World Enterprise Tasks
Michael Wornow
,
Vaishnav Garodia
,
Vasilis Vassalos
,
Utkarsh Contractor
Published: 05 Mar 2025, Last Modified: 16 Apr 2025
BuildingTrust
Readers:
Everyone
Copilot Evaluation Harness: Building User Trust in LLMs and LM Agents for IDE Environments
ICLR 2025 Workshop BuildingTrust Submission108 Authors
11 Feb 2025 (modified: 06 Mar 2025)
Submitted to BuildingTrust
Readers:
Everyone
SafeMERGE: Preserving Safety Alignment in Fine-Tuned Large Language Models via Selective Layer-Wise Model Merging
Aladin Djuhera
,
Swanand Kadhe
,
Farhan Ahmed
,
Syed Zawad
,
Holger Boche
Published: 05 Mar 2025, Last Modified: 25 Mar 2025
BuildingTrust
Readers:
Everyone
Reliable and Efficient Amortized Model-based Evaluation
Sang T. Truong
,
Yuheng Tu
,
Percy Liang
,
Bo Li
,
Sanmi Koyejo
Published: 05 Mar 2025, Last Modified: 25 Mar 2025
BuildingTrust
Readers:
Everyone
Latent Adversarial Training Improves the Representation of Refusal
Alexandra Abbas
,
Nora Petrova
,
Hélios Lyons
,
Natalia Perez-Campanero
Published: 05 Mar 2025, Last Modified: 06 Mar 2025
BuildingTrust
Readers:
Everyone
An Empirical Study on Prompt Compression for Large Language Models
Zhang Zheng
,
Jinyi Li
,
Yihuai Lan
,
Xiang Wang
,
Hao Wang
Published: 05 Mar 2025, Last Modified: 24 Mar 2025
BuildingTrust
Readers:
Everyone
THE FUNDAMENTAL LIMITS OF LLM UNLEARNING: COMPLEXITY-THEORETIC BARRIERS AND PROVABLY OPTIMAL PROTOCOLS
Aviral Srivastava
Published: 05 Mar 2025, Last Modified: 06 Mar 2025
BuildingTrust
Readers:
Everyone
Rethinking Hallucinations: Correctness, Consistency, and Prompt Multiplicity
Prakhar Ganesh
,
Reza Shokri
,
Golnoosh Farnadi
Published: 05 Mar 2025, Last Modified: 15 Apr 2025
BuildingTrust
Readers:
Everyone
XtraGPT: LLMs for Human-AI Collaboration on Controllable Scientific Paper Refinement
ICLR 2025 Workshop BuildingTrust Submission101 Authors
11 Feb 2025 (modified: 06 Mar 2025)
Submitted to BuildingTrust
Readers:
Everyone
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
Hui Wei
,
Shenghua He
,
Tian Xia
,
Fei Liu
,
Andy Wong
,
Jingyang Lin
,
Mei Han
Published: 05 Mar 2025, Last Modified: 08 Apr 2025
BuildingTrust
Readers:
Everyone
Mechanistic Anomaly Detection for "Quirky'' Language Models
David O. Johnston
,
Arkajyoti Chakraborty
,
Nora Belrose
Published: 05 Mar 2025, Last Modified: 15 Apr 2025
BuildingTrust
Readers:
Everyone
Why Are Web AI Agents More Vulnerable Than Standalone LLMs? A Security Analysis
Jeffrey Yang Fan Chiang
,
Seungjae Lee
,
Jia-Bin Huang
,
Furong Huang
,
Yizheng Chen
Published: 05 Mar 2025, Last Modified: 15 Apr 2025
BuildingTrust
Readers:
Everyone
«
‹
1
2
3
4
5
6
›
»