Toggle navigation
OpenReview
.net
Login
×
Back to
ICML
ICML 2024 Workshop NextGenAISafety Submissions
Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models
Francisco Eiras
,
Aleksandar Petrov
,
Philip Torr
,
M. Pawan Kumar
,
Adel Bibi
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
Hummer: Towards Limited Competitive Preference Dataset
Li Jiang
,
Yusen Wu
,
Junwu Xiong
,
Jingqing Ruan
,
Qingpei Guo
,
zujie wen
,
JUN ZHOU
,
Xiaotie Deng
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
Robust CLIP: Unsupervised Adversarial Fine-Tuning of Vision Embeddings for Robust Large Vision-Language Models
Christian Schlarmann
,
Naman Deep Singh
,
Francesco Croce
,
Matthias Hein
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
Black-Box Detection of Language Model Watermarks
Thibaud Gloaguen
,
Nikola Jovanović
,
Robin Staab
,
Martin Vechev
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
A statistical framework for weak-to-strong generalization
Seamus Somerstep
,
Felipe Maia Polo
,
Moulinath Banerjee
,
Yaacov Ritov
,
Mikhail Yurochkin
,
Yuekai Sun
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
Exploiting LLM Quantization
Kazuki Egashira
,
Mark Vero
,
Robin Staab
,
Jingxuan He
,
Martin Vechev
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Oral
Readers:
Everyone
Alignment Calibration: Machine Unlearning for Contrastive Learning under Auditing
Yihan Wang
,
Yiwei Lu
,
Guojun Zhang
,
Franziska Boenisch
,
Adam Dziedzic
,
Yaoliang Yu
,
Xiao-Shan Gao
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Oral
Readers:
Everyone
BELLS: A Framework Towards Future Proof Benchmarks for the Evaluation of LLM Safeguards
Diego Dorn
,
Alexandre Variengien
,
Charbel-Raphael Segerie
,
Vincent Corruble
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Oral
Readers:
Everyone
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses
Xiaosen Zheng
,
Tianyu Pang
,
Chao Du
,
Qian Liu
,
Jing Jiang
,
Min Lin
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
Enhancing the Resilience of LLMs Against Grey-box Extractions
Hanbo Huang
,
Yihan Li
,
Bowen Jiang
,
Bo Jiang
,
Lin Liu
,
Zhuotao Liu
,
Ruoyu Sun
,
Shiyu Liang
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
Deciphering the Definition of Adversarial Robustness for post-hoc OOD Detectors
Peter Lorenz
,
Mario Ruben Fernandez
,
Jens Müller
,
Ullrich Koethe
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
Generating Potent Poisons and Backdoors from Scratch with Guided Diffusion
Hossein Souri
,
Arpit Bansal
,
Hamid Kazemi
,
Liam H Fowl
,
Aniruddha Saha
,
Jonas Geiping
,
Andrew Gordon Wilson
,
Rama Chellappa
,
Tom Goldstein
,
Micah Goldblum
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Oral
Readers:
Everyone
Eliciting Black-Box Representations from LLMs through Self-Queries
Dylan Sam
,
Marc Anton Finzi
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
DiveR-CT: Diversity-enhanced Red Teaming with Relaxing Constraints
Andrew Zhao
,
Quentin Xu
,
Matthieu Lin
,
Shenzhi Wang
,
Yong-jin Liu
,
Zilong Zheng
,
Gao Huang
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
Weak-to-Strong Jailbreaking on Large Language Models
Xuandong Zhao
,
Xianjun Yang
,
Tianyu Pang
,
Chao Du
,
Lei Li
,
Yu-Xiang Wang
,
William Yang Wang
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
AI Alignment with Changing and Influenceable Reward Functions
Micah Carroll
,
Davis Foote
,
Anand Siththaranjan
,
Stuart Russell
,
Anca Dragan
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
Cascade Reward Sampling for Efficient Decoding-Time Alignment
Bolian Li
,
Yifan Wang
,
Ananth Grama
,
Ruqi Zhang
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
Towards Safe Large Language Models for Medicine
Tessa Han
,
Aounon Kumar
,
Chirag Agarwal
,
Himabindu Lakkaraju
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
«
‹
1
2
3
4
›
»