ICLR 2024 Workshop SeT LLM Submissions

Toward Robust Unlearning for LLMs
ICLR 2024 Workshop SeT LLM Submission114 Authors
- Published: 04 Mar 2024, Last Modified: 06 May 2024
- SeT LLM @ ICLR 2024
- Readers: Everyone
PANDORA: Detailed LLM Jailbreaking via Collaborated Phishing Agents with Decomposed Reasoning
Zhaorun Chen, Zhuokai Zhao, Wenjie Qu, Zichen Wen, Zhiguang Han, Zhihong Zhu, Jiaheng Zhang, Huaxiu Yao
- Published: 04 Mar 2024, Last Modified: 14 Apr 2024
- SeT LLM @ ICLR 2024
- Readers: Everyone
Sequence-Level Certainty Reduces Hallucination In Knowledge-Grounded Dialogue Generation
Yixin Wan, Fanyou Wu, Weijie Xu, Srinivasan H. Sengamedu
- Published: 04 Mar 2024, Last Modified: 14 Apr 2024
- SeT LLM @ ICLR 2024
- Readers: Everyone
Differentially Private Synthetic Data via Foundation Model APIs 2: Text
Chulin Xie, Zinan Lin, Arturs Backurs, Sivakanth Gopi, Da Yu, Huseyin A Inan, Harsha Nori, Haotian Jiang, Huishuai Zhang, Yin Tat Lee, Bo Li, Sergey Yekhanin
- Published: 04 Mar 2024, Last Modified: 14 Apr 2024
- SeT LLM @ ICLR 2024
- Readers: Everyone
MARS: Meaning-Aware Response Scoring for Uncertainty Estimation in Generative LLMs
Yavuz Faruk Bakman, Duygu Nur Yaldiz, Baturalp Buyukates, Chenyang Tao, Dimitrios Dimitriadis, Salman Avestimehr
- Published: 04 Mar 2024, Last Modified: 14 Apr 2024
- SeT LLM @ ICLR 2024
- Readers: Everyone
Quantitative Certification of Knowledge Comprehension in LLMs
Isha Chaudhary, Vedaant V Jain, Gagandeep Singh
- Published: 04 Mar 2024, Last Modified: 14 Apr 2024
- SeT LLM @ ICLR 2024
- Readers: Everyone
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
Aleksandar Makelov, Georg Lange, Neel Nanda
- Published: 04 Mar 2024, Last Modified: 14 Apr 2024
- SeT LLM @ ICLR 2024
- Readers: Everyone
Group Preference Optimization: Few-Shot Alignment of Large Language Models
Siyan Zhao, John Dang, Aditya Grover
- Published: 04 Mar 2024, Last Modified: 14 Apr 2024
- SeT LLM @ ICLR 2024
- Readers: Everyone
Source-Aware Training Enables Knowledge Attribution in Language Models
Muhammad Khalifa, David Wadden, Emma Strubell, Honglak Lee, Lu Wang, Iz Beltagy, Hao Peng
- Published: 04 Mar 2024, Last Modified: 14 Apr 2024
- SeT LLM @ ICLR 2024
- Readers: Everyone
Self-evaluation and self-prompting to improve the reliability of LLMs
Alexandre Piché, Aristides Milios, Dzmitry Bahdanau, Christopher Pal
- Published: 04 Mar 2024, Last Modified: 14 Apr 2024
- SeT LLM @ ICLR 2024
- Readers: Everyone
The Effect of Model Size on LLM Post-hoc Explainability via LIME
Henning Heyen, Amy Widdicombe, Noah Yamamoto Siegel, Philip Colin Treleaven, Maria Perez-Ortiz
- Published: 04 Mar 2024, Last Modified: 14 Apr 2024
- SeT LLM @ ICLR 2024
- Readers: Everyone
Attacks on Third-Party APIs of Large Language Models
Wanru Zhao, Vidit Khazanchi, Haodi Xing, Xuanli He, Qiongkai Xu, Nicholas Donald Lane
- Published: 04 Mar 2024, Last Modified: 14 Apr 2024
- SeT LLM @ ICLR 2024
- Readers: Everyone
Walk the Talk? Measuring the Faithfulness of Large Language Model Explanations
ICLR 2024 Workshop SeT LLM Submission95 Authors
- Published: 04 Mar 2024, Last Modified: 15 Apr 2024
- SeT LLM @ ICLR 2024
- Readers: Everyone
Are Large Language Models Bayesian? A Martingale Perspective on In-Context Learning
Fabian Falck, Ziyu Wang, Christopher C. Holmes
- Published: 04 Mar 2024, Last Modified: 14 Apr 2024
- SeT LLM @ ICLR 2024
- Readers: Everyone
Backward Chaining Circuits in a Transformer Trained on a Symbolic Reasoning Task
Jannik Brinkmann, Abhay Sheshadri, Victor Levoso, Paul Swoboda, Christian Bartelt
- Published: 04 Mar 2024, Last Modified: 14 Apr 2024
- SeT LLM @ ICLR 2024
- Readers: Everyone
Single-pass detection of jailbreaking input in large language models
Leyla Naz Candogan, Yongtao Wu, Elias Abad Rocamora, Grigorios Chrysos, Volkan Cevher
- Published: 04 Mar 2024, Last Modified: 14 Apr 2024
- SeT LLM @ ICLR 2024
- Readers: Everyone
Is Your Jailbreaking Prompt Truly Effective for Large Language Models?
ICLR 2024 Workshop SeT LLM Submission90 Authors
- Published: 04 Mar 2024, Last Modified: 19 Apr 2024
- SeT LLM @ ICLR 2024
- Readers: Everyone
I'm not familiar with the name Harry Potter: Prompting Baselines for Unlearning in LLMs
Pratiksha Thaker, Yash Maurya, Virginia Smith
- Published: 04 Mar 2024, Last Modified: 14 Apr 2024
- SeT LLM @ ICLR 2024
- Readers: Everyone
Enhancing and Evaluating Logical Reasoning Abilities of Large Language Models
Shujie Deng, Honghua Dong, Xujie Si
- Published: 04 Mar 2024, Last Modified: 14 Apr 2024
- SeT LLM @ ICLR 2024
- Readers: Everyone
How Susceptible are Large Language Models to Ideological Manipulation?
Kai Chen, Zihao He, Jun Yan, Taiwei Shi, Kristina Lerman
- Published: 04 Mar 2024, Last Modified: 14 Apr 2024
- SeT LLM @ ICLR 2024
- Readers: Everyone
Bayesian reward models for LLM alignment
Adam X. Yang, Maxime Robeyns, Thomas Coste, Jun Wang, Haitham Bou Ammar, Laurence Aitchison
- Published: 04 Mar 2024, Last Modified: 14 Apr 2024
- SeT LLM @ ICLR 2024
- Readers: Everyone
Simple Permutations Can Fool LLaMA: Permutation Attack and Defense for Large Language Models
Liang CHEN, Yatao Bian, Li Shen, Kam-Fai Wong
- Published: 04 Mar 2024, Last Modified: 14 Apr 2024
- SeT LLM @ ICLR 2024
- Readers: Everyone
Character-level robustness should be revisited
Elias Abad Rocamora, Yongtao Wu, Fanghui Liu, Grigorios Chrysos, Volkan Cevher
- Published: 04 Mar 2024, Last Modified: 14 Apr 2024
- SeT LLM @ ICLR 2024
- Readers: Everyone
An Assessment of Model-on-Model Deception
Julius Heitkoetter, Michael Gerovitch, Laker Newhouse
- Published: 04 Mar 2024, Last Modified: 14 Apr 2024
- SeT LLM @ ICLR 2024
- Readers: Everyone
Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks
Samyak Jain, Robert Kirk, Ekdeep Singh Lubana, Robert P. Dick, Hidenori Tanaka, Edward Grefenstette, Tim Rocktäschel, David Krueger
- Published: 04 Mar 2024, Last Modified: 14 Apr 2024
- SeT LLM @ ICLR 2024
- Readers: Everyone