Toggle navigation
OpenReview
.net
Login
×
Back to
ICLR
ICLR 2025 Workshop BuildingTrust Submissions
Why Do Multiagent Systems Fail?
Melissa Z Pan
,
Mert Cemri
,
Lakshya A Agrawal
,
Shuyi Yang
,
Bhavya Chopra
,
Rishabh Tiwari
,
Kurt Keutzer
,
Aditya Parameswaran
,
Kannan Ramchandran
,
Dan Klein
,
Joseph E. Gonzalez
,
Matei Zaharia
,
Ion Stoica
Published: 05 Mar 2025, Last Modified: 25 Apr 2025
BuildingTrust
Readers:
Everyone
Self-Ablating Transformers: More Interpretability, Less Sparsity
Jeremias Lino Ferrao
,
Luhan Mikaelson
,
Keenan Pepper
,
Natalia Perez-Campanero
Published: 05 Mar 2025, Last Modified: 15 Apr 2025
BuildingTrust
Readers:
Everyone
Has My System Prompt Been Used? Large Language Model Prompt Membership Inference
Roman Levin
,
Valeriia Cherepanova
,
Abhimanyu Hans
,
Avi Schwarzschild
,
Tom Goldstein
Published: 05 Mar 2025, Last Modified: 06 Mar 2025
BuildingTrust
Readers:
Everyone
Evaluating Text Humanlikeness via Self-Similarity Exponent
Ilya Pershin
Published: 05 Mar 2025, Last Modified: 10 Apr 2025
BuildingTrust
Readers:
Everyone
Towards Unifying Interpretability and Control: Evaluation via Intervention
ICLR 2025 Workshop BuildingTrust Submission65 Authors
10 Feb 2025 (modified: 06 Mar 2025)
Submitted to BuildingTrust
Readers:
Everyone
Understanding the Relationship between Prompts and Response Uncertainty in Large Language Models
ICLR 2025 Workshop BuildingTrust Submission64 Authors
10 Feb 2025 (modified: 06 Mar 2025)
Submitted to BuildingTrust
Readers:
Everyone
Antipodal Pairing and Mechanistic Signals in Dense SAE Latents
Alessandro Stolfo
,
Ben Peng Wu
,
Mrinmaya Sachan
Published: 05 Mar 2025, Last Modified: 09 Apr 2025
BuildingTrust
Readers:
Everyone
Red Teaming for Trust: Evaluating Multicultural and Multilingual AI Systems in Asia-Pacific
Akash Kundu
,
Adrianna Tan
,
Theodora Skeadas
,
Rumman Chowdhury
,
Sarah Amos
Published: 05 Mar 2025, Last Modified: 09 Apr 2025
BuildingTrust
Readers:
Everyone
Evaluation of Large Language Models via Coupled Token Generation
Nina L. Corvelo Benz
,
Stratis Tsirtsis
,
Eleni Straitouri
,
Ivi Chatzi
,
Ander Artola Velasco
,
Suhas Thejaswi
,
Manuel Gomez Rodriguez
Published: 05 Mar 2025, Last Modified: 15 Apr 2025
BuildingTrust
Readers:
Everyone
Emotional Manipulation is All You Need: A Framework for Evaluating Healthcare Misinformation in LLMs
ICLR 2025 Workshop BuildingTrust Submission60 Authors
10 Feb 2025 (modified: 06 Mar 2025)
Submitted to BuildingTrust
Readers:
Everyone
PRUNING AS A DEFENSE: REDUCING MEMORIZATION IN LARGE LANGUAGE MODELS
Mansi Gupta
,
Nikhar Waghela
,
Sarthak Gupta
,
Shourya Goel
,
Sanjif Shanmugavelu
Published: 05 Mar 2025, Last Modified: 03 Apr 2025
BuildingTrust
Readers:
Everyone
Boosting Adversarial Robustness of Vision-Language Pre-training Models against Multimodal Adversarial attacks
Youze Wang
,
Wenbo Hu
,
Qin Li
,
Richang Hong
Published: 05 Mar 2025, Last Modified: 14 Apr 2025
BuildingTrust
Readers:
Everyone
WebGauntlet: Measuring Instruction Following and Robustness for Web Agents
ICLR 2025 Workshop BuildingTrust Submission56 Authors
10 Feb 2025 (modified: 06 Mar 2025)
Submitted to BuildingTrust
Readers:
Everyone
Analyzing Memorization in Large Language Models through the Lens of Model Attribution
Tarun Ram Menta
,
Susmit Agrawal
,
Chirag Agarwal
Published: 05 Mar 2025, Last Modified: 14 Apr 2025
BuildingTrust
Readers:
Everyone
VideoJail: Exploiting Video-Modality Vulnerabilities for Jailbreak Attacks on Multimodal Large Language Models
Wenbo Hu
,
Shishen Gu
,
Youze Wang
,
Richang Hong
Published: 05 Mar 2025, Last Modified: 15 Apr 2025
BuildingTrust
Readers:
Everyone
Is This Written by AI?
ICLR 2025 Workshop BuildingTrust Submission53 Authors
10 Feb 2025 (modified: 06 Mar 2025)
Submitted to BuildingTrust
Readers:
Everyone
No, Of Course I Can! Refusal Mechanisms Can Be Exploited Using Harmless Data
Joshua Kazdan
,
Lisa Yu
,
Rylan Schaeffer
,
Chris Cundy
,
Sanmi Koyejo
,
Krishnamurthy Dj Dvijotham
Published: 05 Mar 2025, Last Modified: 06 Mar 2025
BuildingTrust
Readers:
Everyone
ExpProof : Operationalizing Explanations for Confidential Models with ZKPs
Chhavi Yadav
,
Evan Laufer
,
Dan Boneh
,
Kamalika Chaudhuri
Published: 05 Mar 2025, Last Modified: 09 Apr 2025
BuildingTrust
Readers:
Everyone
Working Memory Attack on LLMs
Bibek Upadhayay
,
Vahid Behzadan
,
Amin Karbasi
Published: 05 Mar 2025, Last Modified: 13 Apr 2025
BuildingTrust
Readers:
Everyone
Automated Capability Discovery via Model Self-Exploration
Cong Lu
,
Shengran Hu
,
Jeff Clune
Published: 05 Mar 2025, Last Modified: 23 Mar 2025
BuildingTrust
Readers:
Everyone
Automated Feature Labeling with Token-Space Gradient Descent
Julian Schulz
,
Seamus Fallows
Published: 05 Mar 2025, Last Modified: 01 Apr 2025
BuildingTrust
Readers:
Everyone
Model Evaluations Need Rigorous and Transparent Human Baselines
Kevin Wei
,
Patricia Paskov
,
Sunishchal Dev
,
Michael J Byun
,
Anka Reuel
,
Xavier Roberts-Gaal
,
Rachel Calcott
,
Evie Coxon
,
Chinmay Deshpande
Published: 05 Mar 2025, Last Modified: 15 Apr 2025
BuildingTrust
Readers:
Everyone
NLP-EHUGBO: BRIDGING THE FAIRNESS GAP IN LANGUAGE MODELS FOR LOW-RESOURCE AFRICAN DIALECTS
ICLR 2025 Workshop BuildingTrust Submission45 Authors
09 Feb 2025 (modified: 06 Mar 2025)
Submitted to BuildingTrust
Readers:
Everyone
The Jailbreak Tax: How Useful are Your Jailbreak Outputs?
Kristina Nikolić
,
Luze Sun
,
Jie Zhang
,
Florian Tramèr
Published: 05 Mar 2025, Last Modified: 15 Apr 2025
BuildingTrust
Readers:
Everyone
StochasTok: Improving Fine-Grained Subword Understanding in LLMs
Anya Sims
,
Cong Lu
,
Klara Kaleb
,
Jakob Nicolaus Foerster
,
Yee Whye Teh
Published: 05 Mar 2025, Last Modified: 09 Apr 2025
BuildingTrust
Readers:
Everyone
«
‹
1
2
3
4
5
6
›
»