Toggle navigation
OpenReview
.net
Login
×
Back to
NeurIPS
NeurIPS 2024 Workshop SafeGenAi Submissions
Concept Denoising Score Matching for Responsible Text-to-Image Generation
Silpa Vadakkeeveetil Sreelatha
,
Sauradip Nag
,
Serge Belongie
,
Muhammad Awais
,
Anjan Dutta
Published: 12 Oct 2024, Last Modified: 20 Nov 2024
SafeGenAi Poster
Readers:
Everyone
Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy
Tong Wu
,
Shujian Zhang
,
Kaiqiang Song
,
Silei Xu
,
Sanqiang Zhao
,
Ravi Agrawal
,
Sathish Reddy Indurthi
,
Chong Xiang
,
Prateek Mittal
,
Wenxuan Zhou
Published: 12 Oct 2024, Last Modified: 14 Nov 2024
SafeGenAi Poster
Readers:
Everyone
Privacy Protection in Personalized Diffusion Models via Targeted Cross-Attention Adversarial Attack
Xide Xu
,
Muhammad Atif Butt
,
Sandesh Kamath
,
Bogdan Raducanu
Published: 12 Oct 2024, Last Modified: 14 Nov 2024
SafeGenAi Poster
Readers:
Everyone
Rule-Guided Language Model Alignment for Text Generation Management in Industrial Use Cases
Shunichi Akatsuka
,
Aman Kumar
,
Xian Yeow Lee
,
Lasitha Vidyaratne
,
Dipanjan Dipak Ghosh
,
Ahmed K. Farahat
Published: 12 Oct 2024, Last Modified: 19 Nov 2024
SafeGenAi Poster
Readers:
Everyone
Towards Safe and Honest AI Agents with Neural Self-Other Overlap
Marc Carauleanu
,
Michael Vaiana
,
Judd Rosenblatt
,
Cameron Berg
,
Diogo S de Lucena
Published: 12 Oct 2024, Last Modified: 14 Nov 2024
SafeGenAi Oral
Readers:
Everyone
DeepInception: Hypnotize Large Language Model to Be Jailbreaker
Xuan Li
,
Zhanke Zhou
,
Jianing Zhu
,
Jiangchao Yao
,
Tongliang Liu
,
Bo Han
Published: 12 Oct 2024, Last Modified: 14 Nov 2024
SafeGenAi Poster
Readers:
Everyone
HarmLevelBench: Evaluating Harm-Level Compliance and the Impact of Quantization on Model Alignment
Yannis Belkhiter
,
Giulio Zizzo
,
Sergio Maffeis
Published: 12 Oct 2024, Last Modified: 14 Nov 2024
SafeGenAi Poster
Readers:
Everyone
Zer0-Jack: A memory-efficient gradient-based jailbreaking method for black box Multi-modal Large Language Models
Tiejin Chen
,
Kaishen Wang
,
Hua Wei
Published: 12 Oct 2024, Last Modified: 19 Nov 2024
SafeGenAi Poster
Readers:
Everyone
Adversarial Prompt Evaluation: Systematic Benchmarking of Guardrails Against Prompt Input Attacks on LLMs
Giulio Zizzo
,
Giandomenico Cornacchia
,
Kieran Fraser
,
Muhammad Zaid Hameed
,
Ambrish Rawat
,
Beat Buesser
,
Mark Purcell
,
Pin-Yu Chen
,
Prasanna Sattigeri
,
Kush R. Varshney
Published: 12 Oct 2024, Last Modified: 14 Nov 2024
SafeGenAi Oral
Readers:
Everyone
CodeUnlearn: Amortized Zero-Shot Machine Unlearning in Language Models Using Discrete Concept
YuXuan Wu
,
Bonaventure F. P. Dossou
,
Dianbo Liu
Published: 12 Oct 2024, Last Modified: 14 Nov 2024
SafeGenAi Poster
Readers:
Everyone
Auditing Empirical Privacy Protection of Private LLM Adaptations
Lorenzo Rossi
,
Bartłomiej Marek
,
Vincent Hanke
,
Xun Wang
,
Michael Backes
,
Adam Dziedzic
,
Franziska Boenisch
Published: 12 Oct 2024, Last Modified: 26 Nov 2024
SafeGenAi Poster
Readers:
Everyone
Can LLMs Verify Arabic Claims? Evaluating the Arabic Fact-Checking Abilities of Multilingual LLMs
Ayushman Gupta
,
Aryan Singhal
,
Thomas Law
,
Veekshith Rao
,
Evan Duan
,
Ryan Luo Li
Published: 12 Oct 2024, Last Modified: 14 Nov 2024
SafeGenAi Poster
Readers:
Everyone
Rethinking Adversarial Attacks as Protection Against Diffusion-based Mimicry
Haotian Xue
,
Yongxin Chen
Published: 12 Oct 2024, Last Modified: 14 Nov 2024
SafeGenAi Poster
Readers:
Everyone
GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence
Kundan Krishna
,
Sanjana Ramprasad
,
Prakhar Gupta
,
Byron C Wallace
,
Zachary Chase Lipton
,
Jeffrey P. Bigham
Published: 12 Oct 2024, Last Modified: 20 Nov 2024
SafeGenAi Poster
Readers:
Everyone
Exploring Memorization and Copyright Violation in Frontier LLMs: A Study of the New York Times v. OpenAI 2023 Lawsuit
Joshua Freeman
,
Chloe Rippe
,
Edoardo Debenedetti
,
Maksym Andriushchenko
Published: 12 Oct 2024, Last Modified: 24 Nov 2024
SafeGenAi Poster
Readers:
Everyone
Differential Privacy of Cross-Attention with Provable Guarantee
Yingyu Liang
,
Zhenmei Shi
,
Zhao Song
,
Yufa Zhou
Published: 12 Oct 2024, Last Modified: 14 Nov 2024
SafeGenAi Poster
Readers:
Everyone
Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for Large Language Models
Xiaomeng Hu
,
Pin-Yu Chen
,
Tsung-Yi Ho
Published: 12 Oct 2024, Last Modified: 14 Nov 2024
SafeGenAi Poster
Readers:
Everyone
Anchored Optimization and Contrastive Revisions: Addressing Reward Hacking in Alignment
Karel D'Oosterlinck
,
Winnie Xu
,
Chris Develder
,
Thomas Demeester
,
Amanpreet Singh
,
Christopher Potts
,
Douwe Kiela
,
Shikib Mehri
Published: 12 Oct 2024, Last Modified: 14 Nov 2024
SafeGenAi Poster
Readers:
Everyone
Applying Sparse Autoencoders to Unlearn Knowledge in Language Models
Eoin Farrell
,
Yeu-Tong Lau
,
Arthur Conmy
Published: 12 Oct 2024, Last Modified: 14 Nov 2024
SafeGenAi Poster
Readers:
Everyone
A Closer Look at System Message Robustness
Norman Mu
,
Jonathan Lu
,
Michael Lavery
,
David Wagner
Published: 12 Oct 2024, Last Modified: 14 Nov 2024
SafeGenAi Poster
Readers:
Everyone
Targeted Unlearning with Single Layer Unlearning Gradient
Zikui Cai
,
Yaoteng Tan
,
M. Salman Asif
Published: 12 Oct 2024, Last Modified: 14 Nov 2024
SafeGenAi Poster
Readers:
Everyone
Safe Decision Transformer with Learning-based Constraints
Ruhan Wang
,
Dongruo Zhou
Published: 12 Oct 2024, Last Modified: 20 Nov 2024
SafeGenAi Poster
Readers:
Everyone
HEARTS: A Holistic Framework for Explainable, Sustainable and Robust Text Stereotype Detection
Theo King
,
Zekun Wu
,
Adriano Koshiyama
,
Emre Kazim
,
Philip Colin Treleaven
Published: 12 Oct 2024, Last Modified: 02 Dec 2024
SafeGenAi Poster
Readers:
Everyone
Applying Refusal-Vector Ablation to Llama 3.1 70B Agents
Simon Lermen
,
Mateusz Dziemian
,
Govind Pimpale
Published: 12 Oct 2024, Last Modified: 14 Nov 2024
SafeGenAi Poster
Readers:
Everyone
AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment
Pankayaraj Pathmanathan
,
Udari Madhushani Sehwag
,
Michael-Andrei Panaitescu-Liess
,
Furong Huang
Published: 12 Oct 2024, Last Modified: 21 Nov 2024
SafeGenAi Poster
Readers:
Everyone
«
‹
1
2
3
4
5
6
7
›
»