Toggle navigation
OpenReview
.net
Login
×
Back to
ICLR
ICLR 2025 Workshop BuildingTrust Submissions
Finding Sparse Autoencoder Representations Of Errors In CoT Prompting
Justin Theodorus
,
V Swaytha
,
Shivani Gautam
,
Adam Ward
,
Mahir Shah
,
Cole Blondin
,
Kevin Zhu
Published: 05 Mar 2025, Last Modified: 15 Apr 2025
BuildingTrust
Readers:
Everyone
ChunkRAG: A Novel LLM-Chunk Filtering Method for RAG Systems
ICLR 2025 Workshop BuildingTrust Submission151 Authors
14 Feb 2025 (modified: 06 Mar 2025)
Submitted to BuildingTrust
Readers:
Everyone
Exploring Vision-Language Alignment Under Subtle Contradictions
ICLR 2025 Workshop BuildingTrust Submission149 Authors
14 Feb 2025 (modified: 06 Mar 2025)
Submitted to BuildingTrust
Readers:
Everyone
FiDeLiS: Faithful Reasoning in Large Language Models for Knowledge Graph Question Answering
Yuan Sui
,
Yufei He
,
Nian Liu
,
Xiaoxin He
,
Kun Wang
,
Bryan Hooi
Published: 05 Mar 2025, Last Modified: 24 Mar 2025
BuildingTrust
Readers:
Everyone
Can Knowledge Graphs Make Large Language Models More Trustworthy? An Empirical Study Over Open-ended Question Answering
Yuan Sui
,
Yufei He
,
Zifeng Ding
,
Bryan Hooi
Published: 05 Mar 2025, Last Modified: 24 Mar 2025
BuildingTrust
Readers:
Everyone
MKA: Leveraging Cross-Lingual Consensus for Model Abstention
Sharad Duwal
Published: 05 Mar 2025, Last Modified: 31 Mar 2025
BuildingTrust
Readers:
Everyone
An Afrocentric Perspective on Algorithm Watermarking of AI-generated Content.
ICLR 2025 Workshop BuildingTrust Submission145 Authors
14 Feb 2025 (modified: 06 Mar 2025)
Submitted to BuildingTrust
Readers:
Everyone
Hidden No More: Attacking and Defending Private Third-Party LLM Inference
Arka Pal
,
Rahul Krishna Thomas
,
Louai Zahran
,
Erica Choi
,
Akilesh Potti
,
Micah Goldblum
Published: 05 Mar 2025, Last Modified: 15 Apr 2025
BuildingTrust
Readers:
Everyone
Seeds, Contexts, and Tongues: Decoding the Drivers of Hallucination in Language Models
ICLR 2025 Workshop BuildingTrust Submission143 Authors
13 Feb 2025 (modified: 06 Mar 2025)
Submitted to BuildingTrust
Readers:
Everyone
A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection
Gabriel Chua
,
Chan Shing Yee
,
Shaun Khoo
Published: 05 Mar 2025, Last Modified: 10 Apr 2025
BuildingTrust
Readers:
Everyone
Strengthening Robustness to Adversarial Prompts: The Role of Multi-Agent Conversations in Large Language Models
ICLR 2025 Workshop BuildingTrust Submission140 Authors
11 Feb 2025 (modified: 06 Mar 2025)
Submitted to BuildingTrust
Readers:
Everyone
Achieving Exact Federated Unlearning with Improved Post-Unlearning Performance
ICLR 2025 Workshop BuildingTrust Submission139 Authors
11 Feb 2025 (modified: 06 Mar 2025)
Submitted to BuildingTrust
Readers:
Everyone
ORTHOGONAL SAE: FEATURE DISENTANGLEMENT THROUGH COMPETITION-AWARE ORTHOGONALITY CONSTRAINTS
ICLR 2025 Workshop BuildingTrust Submission138 Authors
11 Feb 2025 (modified: 06 Mar 2025)
Submitted to BuildingTrust
Readers:
Everyone
UTF: Undertrained Tokens as Fingerprints —— A Novel Approach to LLM Identification
ICLR 2025 Workshop BuildingTrust Submission137 Authors
11 Feb 2025 (modified: 06 Mar 2025)
Submitted to BuildingTrust
Readers:
Everyone
Benchmarking Intent Awareness in Prompt Injection Guardrail Models
ICLR 2025 Workshop BuildingTrust Submission136 Authors
11 Feb 2025 (modified: 06 Mar 2025)
Submitted to BuildingTrust
Readers:
Everyone
Data Efficient Subset Training with Differential Privacy
ICLR 2025 Workshop BuildingTrust Submission135 Authors
11 Feb 2025 (modified: 06 Mar 2025)
Submitted to BuildingTrust
Readers:
Everyone
PATTERNS AND MECHANISMS OF CONTRASTIVE ACTIVATION ENGINEERING
Yixiong Hao
,
Ayush Panda
,
Stepan Shabalin
,
Sheikh Abdur Raheem Ali
Published: 05 Mar 2025, Last Modified: 14 Apr 2025
BuildingTrust
Readers:
Everyone
UNLOCKING HIERARCHICAL CONCEPT DISCOVERY IN LANGUAGE MODELS THROUGH GEOMETRIC REGULARIZATION
Ed Li
,
Junyu Ren
Published: 13 Mar 2025, Last Modified: 16 Apr 2025
BuildingTrust
Readers:
Everyone
The Steganographic Potentials of Language Models
Artem Karpov
,
Tinuade Adeleke
,
Seong Hah Cho
,
Natalia Perez-Campanero
Published: 05 Mar 2025, Last Modified: 15 Apr 2025
BuildingTrust
Readers:
Everyone
Investigating the Effects of Emotional Stimuli Type and Intensity on Large Language Model (LLM) Behavior
ICLR 2025 Workshop BuildingTrust Submission130 Authors
11 Feb 2025 (modified: 06 Mar 2025)
Submitted to BuildingTrust
Readers:
Everyone
Siege: Multi-Turn Jailbreaking of Large Language Models with Tree Search
Andy Zhou
,
Ron Arel
Published: 05 Mar 2025, Last Modified: 17 Apr 2025
BuildingTrust
Readers:
Everyone
TRUTH DECAY: Quantifying Multi-Turn Sycophancy in Language Models
ICLR 2025 Workshop BuildingTrust Submission127 Authors
11 Feb 2025 (modified: 06 Mar 2025)
Submitted to BuildingTrust
Readers:
Everyone
LLM Neurosurgeon: Targeted Knowledge Removal in LLMs using Sparse Autoencoders
Kunal Patil
,
Dylan Zhou
,
Yifan Sun
,
Karthik lakshmanan
,
Senthooran Rajamanoharan
,
Arthur Conmy
Published: 05 Mar 2025, Last Modified: 15 Apr 2025
BuildingTrust
Readers:
Everyone
Conformal Structured Prediction
Botong Zhang
,
Shuo Li
,
Osbert Bastani
Published: 05 Mar 2025, Last Modified: 01 Apr 2025
BuildingTrust
Readers:
Everyone
Steering Fine-Tuning Generalization with Targeted Concept Ablation
Helena Casademunt
,
Caden Juang
,
Samuel Marks
,
Senthooran Rajamanoharan
,
Neel Nanda
Published: 05 Mar 2025, Last Modified: 17 Apr 2025
BuildingTrust
Readers:
Everyone
«
‹
1
2
3
4
5
6
›
»