Toggle navigation
OpenReview
.net
Login
×
Back to
ICLR
ICLR 2024 Workshop SeT LLM Submissions
Coercing LLMs to do and reveal (almost) anything
Jonas Geiping
,
Alex Stein
,
Manli Shu
,
Khalid Saifullah
,
Yuxin Wen
,
Tom Goldstein
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
Fight Back Against Jailbreaking via Prompt Adversarial Tuning
ICLR 2024 Workshop SeT LLM Submission72 Authors
Published: 04 Mar 2024, Last Modified: 09 Jun 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
Tailoring Self-Rationalizers with Multi-Reward Distillation
Sahana Ramnath
,
Brihi Joshi
,
Skyler Hallinan
,
Ximing Lu
,
Liunian Harold Li
,
Aaron Chan
,
Jack Hessel
,
Yejin Choi
,
Xiang Ren
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
Safer-Instruct: Aligning Language Models with Automated Preference Data
Taiwei Shi
,
Kai Chen
,
Jieyu Zhao
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
TrustScore: Reference-Free Evaluation of LLM Response Trustworthiness
Danna Zheng
,
Danyang Liu
,
Mirella Lapata
,
Jeff Z. Pan
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
Fengqing Jiang
,
Zhangchen Xu
,
Luyao Niu
,
Zhen Xiang
,
Bhaskar Ramasubramanian
,
Bo Li
,
Radha Poovendran
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding
Zhangchen Xu
,
Fengqing Jiang
,
Luyao Niu
,
Jinyuan Jia
,
Bill Yuchen Lin
,
Radha Poovendran
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B
Simon Lermen
,
Charlie Rogers-Smith
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
WinoViz: Probing Visual Properties of Objects Under Different States
Woojeong Jin
,
Tejas Srinivasan
,
Jesse Thomason
,
Xiang Ren
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
Attacking LLM Watermarks by Exploiting Their Strengths
Qi Pang
,
Shengyuan Hu
,
Wenting Zheng
,
Virginia Smith
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
WatME: Towards Lossless Watermarking Through Lexical Redundancy
Liang CHEN
,
Yatao Bian
,
Yang Deng
,
Deng Cai
,
Shuaiyi Li
,
Peilin Zhao
,
Kam-Fai Wong
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
Red Teaming GPT-4V: Are GPT-4V Safe Against Uni/Multi-Modal Jailbreak Attacks?
Shuo Chen
,
Zhen Han
,
Bailan He
,
Zifeng Ding
,
Wenqian Yu
,
Philip Torr
,
Volker Tresp
,
Jindong Gu
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
Calibrating Language Models With Adaptive Temperature Scaling
Johnathan Xie
,
Annie S Chen
,
Yoonho Lee
,
Eric Mitchell
,
Chelsea Finn
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
Decoding Compressed Trust: Scrutinizing the Trustworthiness of Efficient LLMs Under Compression
Junyuan Hong
,
Jinhao Duan
,
Chenhui Zhang
,
Zhangheng LI
,
Chulin Xie
,
Kelsey Lieberman
,
James Diffenderfer
,
Brian R. Bartoldson
,
AJAY KUMAR JAISWAL
,
Kaidi Xu
,
Bhavya Kailkhura
,
Dan Hendrycks
,
Dawn Song
,
Zhangyang Wang
,
Bo Li
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
On Trojan Signatures in Large Language Models of Code
Aftab Hussain
,
Md Rafiqul Islam Rabin
,
Amin Alipour
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
DUAW: Data-free Universal Adversarial Watermark against Stable Diffusion Customization
Xiaoyu Ye
,
Hao Huang
,
Jiaqi An
,
Yongtao Wang
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
A closer look at adversarial suffix learning for Jailbreaking LLMs
Zhe Wang
,
Yanjun Qi
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
Initial Response Selection for Prompt Jailbreaking using Model Steering
Thien Q. Tran
,
Koki Wataoka
,
Tsubasa Takahashi
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
Retrieval Augmented Prompt Optimization
Yifan Sun
,
Jean-Baptiste Tien
,
Karthik lakshmanan
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
Explorations of Self-Repair in Language Model
Cody Rushing
,
Neel Nanda
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
Break it, Imitate it, Fix it: Robustness by Generating Human-Like Attacks
Aradhana Sinha
,
Ananth Balashankar
,
Ahmad Beirami
,
Thi Avrahami
,
Jilin Chen
,
Alex Beutel
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
Mikayel Samvelyan
,
Sharath Chandra Raparthy
,
Andrei Lupu
,
Eric Hambro
,
Aram H. Markosyan
,
Manish Bhatt
,
Yuning Mao
,
Minqi Jiang
,
Jack Parker-Holder
,
Jakob Nicolaus Foerster
,
Tim Rocktäschel
,
Roberta Raileanu
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
PETA: PARAMETER-EFFICIENT TROJAN ATTACKS
Lauren Hong
,
Ting Wang
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
How many Opinions does your LLM have? Improving Uncertainty Estimation in NLG
Lukas Aichberger
,
Kajetan Schweighofer
,
Mykyta Ielanskyi
,
Sepp Hochreiter
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
TOFU: A Task of Fictitious Unlearning for LLMs
Pratyush Maini
,
Zhili Feng
,
Avi Schwarzschild
,
Zachary Chase Lipton
,
J Zico Kolter
Published: 04 Mar 2024, Last Modified: 14 Apr 2024
SeT LLM @ ICLR 2024
Readers:
Everyone
«
‹
1
2
3
›
»