Toggle navigation
OpenReview
.net
Login
×
Back to
NeurIPS
NeurIPS 2025 Workshop LLM Evaluation Submissions
LLMs as Judges for Domain-Specific Text: Evidence from Drilling Reports
Abdallah Benzine
,
Soumyadipta Sengupta
,
Sebastiaan Buiting
,
Imane Khaouja
,
Yahia Salaheldin Shaaban
,
Amine EL KHAIR
Published: 24 Sept 2025, Last Modified: 24 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
Whose Personae? Synthetic Persona Experiments in LLM Research and Pathways to Transparency
Jan Batzner
,
Volker Stocker
,
Bingjun Tang
,
Anusha Natarajan
,
Qinhao Chen
,
Stefan Schmid
,
Gjergji Kasneci
Published: 24 Sept 2025, Last Modified: 06 Oct 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
The Social Laboratory: A Psychometric Framework for Multi-Agent LLM Evaluation
Zarreen Reza
Published: 24 Sept 2025, Last Modified: 24 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
ChatChecker: A Framework for Dialogue System Testing Through Non-cooperative User Simulation
Roman Mayr
,
Michel Schimpf
,
Thomas Bohné
Published: 24 Sept 2025, Last Modified: 24 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
Prompt Genotyping: Quantifying the Evaluation Gap Between Synthetic Benchmarks and Real LLM Performance
Sohum Mehta
,
Saaketh Bhojanam
Published: 24 Sept 2025, Last Modified: 25 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
How Many Instructions Can LLMs Follow at Once?
Daniel Jaroslawicz
,
Brendan Whiting
,
Parth Shah
,
Karime Maamari
Published: 24 Sept 2025, Last Modified: 24 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
Medical AI Consensus: A Multi-Agent Framework for Radiology Report Generation and Evaluation
Ahmed Tamer El Boardy
,
Ghada Khoriba
,
Essam Rashed
Published: 24 Sept 2025, Last Modified: 24 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
Sycophancy Claims about Language Models: The Missing Human-in-the-Loop
Jan Batzner
,
Volker Stocker
,
Stefan Schmid
,
Gjergji Kasneci
Published: 24 Sept 2025, Last Modified: 24 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
MORPHOGEN: A Multilingual Benchmark for Evaluating Gender-Aware Morphological Generation
Aditya Aggarwal
,
Mehul Agarwal
,
Arnav Goel
,
Medha Hira
,
Anubha Gupta
Published: 24 Sept 2025, Last Modified: 24 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
Self-Evaluating LLMs for Multi-Step Tasks: Stepwise Confidence Estimation for Failure Detection
Vaibhav Mavi
,
Shubh Jaroria
,
Weiqi Sun
Published: 24 Sept 2025, Last Modified: 24 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
Towards Transparent Reasoning: What Drives Faithfulness in Large Language Models?
Teague McMillan
,
Gabriele Dominici
,
Martin Gjoreski
,
Marc Langheinrich
Published: 24 Sept 2025, Last Modified: 28 Oct 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
PosterSum: A Multimodal Benchmark for Scientific Poster Summarization
Rohit Saxena
,
Pasquale Minervini
,
Frank Keller
Published: 24 Sept 2025, Last Modified: 28 Oct 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
LLMs Show Surface-Form Brittleness Under Paraphrase Stress Tests
Juan Miguel Navarro Carranza
Published: 24 Sept 2025, Last Modified: 08 Oct 2025
NeurIPS 2025 LLM Evaluation Workshop Oral
Readers:
Everyone
Born with a SilverSpoon? Investigating Socioeconomic Bias in LLMs
Smriti Singh
,
Shuvam Keshari
,
Vinija Jain
,
Aman Chadha
Published: 24 Sept 2025, Last Modified: 24 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
Phase-Transitional Scaling
Kalyan Cherukuri
,
Aarav Lala
Published: 24 Sept 2025, Last Modified: 05 Oct 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
HORIZON: A Benchmark for In-the-wild User Behaviour Modeling
Arnav Goel
,
Pranjal A Chitale
,
Bhawna Paliwal
,
Bishal Santra
,
Amit Sharma
Published: 24 Sept 2025, Last Modified: 24 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
Who’s the Impostor? Multi‑Agent Social Deduction for Evaluating LLM Social Reasoning
Xiang Fu
Published: 24 Sept 2025, Last Modified: 24 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
Probe-Rewrite-Evaluate: A Workflow for Reliable Benchmarks and Quantifying Evaluation Awareness
Lang Xiong
,
Nishant Bhargava
,
Jeremy Chang
,
Jianhang Hong
,
Haihao Liu
,
Vasu Sharma
,
Kevin Zhu
Published: 24 Sept 2025, Last Modified: 24 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
A Case for Centaur Evaluations
Andreas Haupt
,
Erik Brynjolfsson
Published: 24 Sept 2025, Last Modified: 24 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
Network Dynamics Reasoning: A Novel Benchmark for Evaluating Multi-Step Inference in Large Language Models
Andrew Bae
,
Saaketh Bhojanam
,
Laksh Patel
Published: 24 Sept 2025, Last Modified: 24 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
SAGE: A Realistic Benchmark for Semantic Understanding
Samarth Goel
,
Reagan Lee
,
Kannan Ramchandran
Published: 24 Sept 2025, Last Modified: 20 Oct 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
CAVE: Detecting and Explaining Commonsense Anomalies in Visual Environments
Rishika Bhagwatkar
,
Syrielle Montariol
,
Angelika Romanou
,
Beatriz Borges
,
Irina Rish
,
Antoine Bosselut
Published: 24 Sept 2025, Last Modified: 24 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
LLMs vs. Traditional Sentiment Tools in Psychology: An Evaluation on Belgian-Dutch Narratives
Ratna Kandala
,
Katie Hoemann
Published: 24 Sept 2025, Last Modified: 24 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
Evaluating LLM Story Generation through Large-scale Network Analysis on Social Structures
Hiroshi Nonaka
,
K. E. Perry
Published: 24 Sept 2025, Last Modified: 24 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
Evaluation Under Imperfect Benchmarks and Ratings: A Case Study in Text Simplification
Joseph Liu
,
Yoonsoo Nam
,
Xinyue Cui
,
Swabha Swayamdipta
Published: 24 Sept 2025, Last Modified: 24 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
«
‹
1
2
3
4
5
6
7
8
›
»