Toggle navigation
OpenReview
.net
Login
×
Back to
ICLR
ICLR 2024 Workshop SeT LLM Submissions
Preventing Memorized Completions through White-Box Filtering
ICLR 2024 Workshop SeT LLM Submission30 Authors
Published: 04 Mar 2024, Last Modified: 19 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
What's in Your "Safe" Data?: Identifying Benign Data that Breaks Safety
Luxi He
,
Mengzhou Xia
,
Peter Henderson
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models
Xianjun Yang
,
Xiao Wang
,
Qi Zhang
,
Linda Ruth Petzold
,
William Yang Wang
,
Xun Zhao
,
Dahua Lin
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
Assessing Prompt Injection Risks in 200+ Custom GPTs
Jiahao Yu
,
Yuhang Wu
,
Dong Shu
,
Mingyu Jin
,
Sabrina Yang
,
Xinyu Xing
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
CollabEdit: Towards Non-destructive Collaborative Knowledge Editing
Jiamu Zheng
,
Jinghuai Zhang
,
Futing Wang
,
Tianyu Du
,
Tao Lin
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
On Prompt-Driven Safeguarding for Large Language Models
Chujie Zheng
,
Fan Yin
,
Hao Zhou
,
Fandong Meng
,
Jie Zhou
,
Kai-Wei Chang
,
Minlie Huang
,
Nanyun Peng
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
BEYOND FINE-TUNING: LORA MODULES BOOST NEAR- OOD DETECTION AND LLM SECURITY
Etienne Salimbeni
,
Francesco Craighero
,
Renata Khasanova
,
Milos Vasic
,
Pierre Vandergheynst
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
Privacy-preserving Fine-tuning of Large Language Models through Flatness
Tiejin Chen
,
Longchao Da
,
Huixue Zhou
,
Pingzhi Li
,
Kaixiong Zhou
,
Tianlong Chen
,
Hua Wei
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
Shadowcast: Stealthy Data Poisoning Attacks Against Vision-Language Models
Yuancheng Xu
,
Jiarui Yao
,
Manli Shu
,
Yanchao Sun
,
Zichu Wu
,
Ning Yu
,
Tom Goldstein
,
Furong Huang
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
Watermark Stealing in Large Language Models
Nikola Jovanović
,
Robin Staab
,
Martin Vechev
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
Leveraging Context in Jailbreaking Attacks
Yixin Cheng
,
Markos Georgopoulos
,
Volkan Cevher
,
Grigorios Chrysos
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
Large Language Model Bias Mitigation from the Perspective of Knowledge Editing
Ruizhe Chen
,
Yichen Li
,
Zikai Xiao
,
Zuozhu Liu
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
Self-Alignment of Large Language Models via Social Scene Simulation
Xianghe Pang
,
Shuo Tang
,
Rui Ye
,
Yuxin Xiong
,
Bolun Zhang
,
Yanfeng Wang
,
Siheng Chen
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
Steering LLMs Towards Unbiased Responses: A Causality-Guided Debiasing Framework
ICLR 2024 Workshop SeT LLM Submission12 Authors
Published: 04 Mar 2024, Last Modified: 16 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
Can LLMs Separate Instructions From Data? And What Do We Even Mean By That?
Egor Zverev
,
Sahar Abdelnabi
,
Mario Fritz
,
Christoph H. Lampert
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
Exploring the Adversarial Capabilities of Large Language Models
Lukas Struppek
,
Minh Hieu Le
,
Dominik Hintersdorf
,
Kristian Kersting
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
On Fairness Implications and Evaluations of Low-Rank Adaptation of Large Models
Ken Liu
,
Zhoujie Ding
,
Berivan Isik
,
Sanmi Koyejo
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models
Haibo Jin
,
Ruoxi Chen
,
Andy Zhou
,
Yang Zhang
,
Haohan Wang
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
Andy Zhou
,
Bo Li
,
Haohan Wang
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Boyi Wei
,
Kaixuan Huang
,
Yangsibo Huang
,
Tinghao Xie
,
Xiangyu Qi
,
Mengzhou Xia
,
Prateek Mittal
,
Mengdi Wang
,
Peter Henderson
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
Watermarks in the Sand: Impossibility of Strong Watermarking for Generative Models
Hanlin Zhang
,
Benjamin L. Edelman
,
Danilo Francati
,
Daniele Venturi
,
Giuseppe Ateniese
,
Boaz Barak
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
Open Sesame! Universal Black-Box Jailbreaking of Large Language Models
Raz Lapid
,
Ron Langberg
,
Moshe Sipper
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
«
‹
1
2
3
›
»