Toggle navigation
OpenReview
.net
Login
×
Back to
ICML
ICML 2024 Workshop NextGenAISafety Submissions
Towards Adversarially Robust Vision-Language Models: Insights from Design Choices and Prompt Formatting Techniques
Rishika Bhagwatkar
,
Shravan Nayak
,
Reza Bayat
,
Alexis Roger
,
Daniel Z Kaplan
,
Pouya Bashivan
,
Irina Rish
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
Fairness through partial awareness: Evaluation of the addition of demographic information for bias mitigation methods
Chung Peng Lee
,
Rachel Hong
,
Jamie Heather Morgenstern
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
Adversarial Training with Synthesized Data: A Path to Robust and Generalizable Neural Networks
Reza Bayat
,
Irina Rish
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models
Bang An
,
Sicheng Zhu
,
Ruiyi Zhang
,
Michael-Andrei Panaitescu-Liess
,
Yuancheng Xu
,
Furong Huang
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
Why Has Predicting Downstream Capabilities of Frontier AI Models with Scale Remained Elusive?
Rylan Schaeffer
,
Hailey Schoelkopf
,
Brando Miranda
,
Gabriel Mukobi
,
Varun Madan
,
Adam Ibrahim
,
Herbie Bradley
,
Stella Biderman
,
Sanmi Koyejo
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
Robust Knowledge Unlearning via Mechanistic Localizations
Phillip Huang Guo
,
Aaquib Syed
,
Abhay Sheshadri
,
Aidan Ewart
,
Gintare Karolina Dziugaite
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
Measuring Goal-Directedness
Matt MacDermott
,
James Fox
,
Francesco Belardinelli
,
Tom Everitt
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
Is ChatGPT Transforming Academics' Writing Style?
Mingmeng GENG
,
Roberto Trotta
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
A Sim2Real Approach for Identifying Task-Relevant Properties in Interpretable Machine Learning
Eura Nofshin
,
Esther Brown
,
Brian Lim
,
Weiwei Pan
,
Finale Doshi-Velez
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models
Liwei Jiang
,
Kavel Rao
,
Seungju Han
,
Allyson Ettinger
,
Faeze Brahman
,
Sachin Kumar
,
Niloofar Mireshghallah
,
Ximing Lu
,
Maarten Sap
,
Nouha Dziri
,
Yejin Choi
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
Adversarial Robustness Limits via Scaling-Law and Human-Alignment Studies
Brian R. Bartoldson
,
James Diffenderfer
,
Konstantinos Parasyris
,
Bhavya Kailkhura
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
Talking Nonsense: Probing Large Language Models' Understanding of Adversarial Gibberish Inputs
Valeriia Cherepanova
,
James Zou
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
Large Language Models as Misleading Assistants in Conversation
Betty Li Hou
,
Kejian Shi
,
Jason Phang
,
James Aung
,
Steven Adler
,
Rosie Campbell
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
Rule Based Rewards for Fine-Grained LLM Safety
Tong Mu
,
Alec Helyar
,
Johannes Heidecke
,
Joshua Achiam
,
Andrea Vallone
,
Ian D Kivlichan
,
Molly Lin
,
Alex Beutel
,
John Schulman
,
Lilian Weng
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
Can Go AIs be adversarially robust?
Tom Tseng
,
Euan McLean
,
Kellin Pelrine
,
Tony Tong Wang
,
Adam Gleave
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
Distillation based Robustness Verification with PAC Guarantees
Patrick Indri
,
Peter Blohm
,
Anagha Athavale
,
Ezio Bartocci
,
Georg Weissenbacher
,
Matteo Maffei
,
Dejan Nickovic
,
Thomas Gärtner
,
SAGAR MALHOTRA
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
Improving the Efficiency of Self-Supervised Adversarial Training through Latent Clustering-based Selection
Somrita Ghosh
,
Yuelin Xu
,
Xiao Zhang
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
Certifiably Robust RAG against Retrieval Corruption
Chong Xiang
,
Tong Wu
,
Zexuan Zhong
,
David Wagner
,
Danqi Chen
,
Prateek Mittal
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
Fairness Through Controlled (Un)Awareness in Node Embeddings
Dennis Vetter
,
Jasper Forth
,
Gemma Roig
,
Holger Dell
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
Can Watermarking Large Language Models Prevent Copyrighted Text Generation and Hide Training Data?
Michael-Andrei Panaitescu-Liess
,
Zora Che
,
Bang An
,
Yuancheng Xu
,
Pankayaraj Pathmanathan
,
Souradip Chakraborty
,
Sicheng Zhu
,
Tom Goldstein
,
Furong Huang
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
AI Agents with Formal Security Guarantees
Mislav Balunovic
,
Luca Beurer-Kellner
,
Marc Fischer
,
Martin Vechev
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
Efficient Differentially Private Fine-Tuning of Diffusion Models
Jing Liu
,
Andrew Lowy
,
Toshiaki Koike-Akino
,
Kieran Parsons
,
Ye Wang
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
OxonFair: A Flexible Toolkit for Algorithmic Fairness
Eoin D. Delaney
,
Zihao Fu
,
Sandra Wachter
,
Brent Mittelstadt
,
Chris Russell
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
ContextCite: Attributing Model Generation to Context
Benjamin Cohen-Wang
,
Harshay Shah
,
Kristian Georgiev
,
Aleksander Madry
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
Private Attribute Inference from Images with Vision-Language Models
Batuhan Tömekçe
,
Mark Vero
,
Robin Staab
,
Martin Vechev
Published: 28 Jun 2024, Last Modified: 25 Jul 2024
NextGenAISafety 2024 Poster
Readers:
Everyone
«
‹
1
2
3
4
›
»