Toggle navigation
OpenReview
.net
Login
×
Back to
NeurIPS
NeurIPS 2025 Workshop LLM Evaluation Submissions
Metrics for Holistic Evaluation of LLM Reasoning about Action, Change, and Planning
Anil B Murthy
,
Jaron Mink
,
Lindsay Sanneman
Published: 24 Sept 2025, Last Modified: 25 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
Adversarial Behavior in Research Settings: Conducting Control Evaluations with RE-Bench
Harini Rajakumar
,
Vanessa Nwauwa
,
Kevin Zhu
,
Ashwinee Panda
,
Sunishchal Dev
Published: 24 Sept 2025, Last Modified: 24 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
Evaluating LLMs for Combinatorial Optimization: One-Phase and Two-Phase Heuristics for 2D Bin-Packing
Syed Mahbubul Huq
,
Daniel Brito-Pacheco
,
Daniel Sikar
,
RAJESH MOJUMDER
,
Christopher Child
,
Tillman Weyde
Published: 24 Sept 2025, Last Modified: 01 Oct 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
ASCII-Bench: Evaluating Language-Model-Based Understanding of Visually-Oriented Text
Kerry Luo
,
Joshua Peguero
,
Anvay Patil
,
Megan Van Overborg
,
Ryan Sarmiento
,
Kevin Zhu
Published: 24 Sept 2025, Last Modified: 24 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark Evaluation
Yotam Perlitz
,
Ariel Gera
,
Ofir Arviv
,
Asaf Yehudai
,
Elron Bandel
,
Eyal Shnarch
,
Michal Shmueli-Scheuer
,
Leshem Choshen
Published: 24 Sept 2025, Last Modified: 24 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
Agentic Lean Auformalization (ALA): An LLM collaborative approach to autoformalization in LEAN
Patricio Gallardo
,
Maziar Raissi
,
Ke Zhang
,
Sudhir Murthy
Published: 24 Sept 2025, Last Modified: 24 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
Beyond Accuracy: A Diagnostic Protocol for Fairly Evaluating Multimodal Reasoning
Shohreh Ghorbani
,
Chenyu Zhang
,
Minsol Kim
,
Jingyao Wu
Published: 24 Sept 2025, Last Modified: 24 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
Precision Shapes Personality: The Hidden Cost of Quantization in Sub-Billion-LLMs
Soham Sen
,
Ishaan Gangwani
Published: 24 Sept 2025, Last Modified: 24 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
The Price of Progress
Hans Gundlach
,
Jayson Lynch
,
Matthias Mertens
,
Neil Thompson
Published: 24 Sept 2025, Last Modified: 24 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
Detecting Training Data of Large Language Models via Expectation Maximization
Gyuwan Kim
,
Yang Li
,
Evangelia Spiliopoulou
,
Jie Ma
,
Miguel Ballesteros
,
William Yang Wang
Published: 24 Sept 2025, Last Modified: 24 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Oral
Readers:
Everyone
LaTeXBench: Judge-Only Evaluation of LaTeX Generation, Minimal-Edit Compliance, and Blind Contrast Errors
Ishaan Gangwani
,
Soham Sen
,
Aayam Bansal
Published: 24 Sept 2025, Last Modified: 24 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
A Multi-Aspect Evaluation of Dialogue in Pythia
Zixun Chen
,
Petr Babkin
,
Akshat Gupta
,
Gopala Anumanchipalli
,
Xiaomo Liu
Published: 24 Sept 2025, Last Modified: 24 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
Benchmarking Overton Pluralism in LLMs
Elinor Poole-Dayan
,
Jiayi Wu
,
Jiaxin Pei
,
Michiel A. Bakker
Published: 24 Sept 2025, Last Modified: 24 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
YKSBench: Stress-Testing Multimodal Models with Exam-Style Questions
Egemen Sert
,
Seyda Ertekin
Published: 24 Sept 2025, Last Modified: 24 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
Evaluating LLMs' Language Confusion in Code-switching Context
Juhyun Oh
,
Haneul Yoo
,
Alice Oh
Published: 24 Sept 2025, Last Modified: 24 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
OpenGovCorpus: Evaluating LLMs on Citizen Query Tasks
Neil Majithia
,
Rajat Shinde
,
Manil Maskey
,
Elena Simperl
Published: 24 Sept 2025, Last Modified: 24 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
R3: Robust Rubric-Agnostic Reward Models
David Anugraha
,
Zilu Tang
,
Lester James Validad Miranda
,
Hanyang Zhao
,
Shou-Yi Hung
,
Mohammad Rifqi Farhansyah
,
Garry Kuwanto
,
Derry Tanti Wijaya
,
Genta Indra Winata
Published: 24 Sept 2025, Last Modified: 24 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
ChinaTravel: An Open-Ended Benchmark for Language Agents in Chinese Travel Planning
Jie-Jing Shao
,
Bo-Wen Zhang
,
Xiao-Wen Yang
,
Baizhi Chen
,
Siyu Han
,
Wen-Da Wei
,
Guohao Cai
,
Zhenhua Dong
,
Lan-Zhe Guo
,
Yu-Feng Li
Published: 24 Sept 2025, Last Modified: 09 Oct 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
Husky Hold'em Benchmark: Can LLMs Design Competitive Poker Bots?
Bhavesh Kumar
,
Hoang Doan Nguyen
,
Roger Jin
,
Ryan Teknium
,
Jeffrey Quesnelle
Published: 24 Sept 2025, Last Modified: 24 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
Evaluating LLM-as-a-Judge under Multilingual, Multimodal and Multi-domain Constraints
Shreyansh Padarha
,
Elizaveta Semenova
,
Bertie Vidgen
,
Adam Mahdi
,
Scott A. Hale
Published: 24 Sept 2025, Last Modified: 29 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
GASLIGHTBENCH: Quantifying LLM Susceptibility to Social Prompting
Lening Nick Cui
,
Sahil Ghosh
,
Gareth Lee
,
Xuanzhe Yao
,
Swarit Srivastava
,
William H. Logian
,
Michael Li
,
Kevin Zhu
,
Sunishchal Dev
,
Michael Saxon
,
Aaron Sandoval
,
Sean O'Brien
,
Ellie Podoshev
Published: 24 Sept 2025, Last Modified: 25 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
GermanPartiesQA: Benchmarking Commercial Large Language Models and AI Companions for Political Alignment and Sycophancy
Jan Batzner
,
Volker Stocker
,
Stefan Schmid
,
Gjergji Kasneci
Published: 24 Sept 2025, Last Modified: 06 Oct 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
A Protocol-Driven Platform for Agent-Agnostic Evaluation of LLM Agents
Cong Minh Tran
,
Issam Falih
,
Hatim CHAHDI
,
Romain DE LA SOUCHERE
Published: 24 Sept 2025, Last Modified: 24 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
MiniCzechBenchmark: A Contamination-Resistant Framework for Rapid LLM Evaluation in Czech
Petr Simecek
Published: 24 Sept 2025, Last Modified: 24 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
RepliBench: Evaluating the Autonomous Replication Capabilities of Language Model Agents
Sidney Black
,
Asa Cooper Stickland
,
Jake Pencharz
,
Oliver Sourbut
,
Michael Schmatz
,
Jay Bailey
,
Ollie Matthews
,
Ben Millwood
,
Alex Remedios
,
Alan Cooney
Published: 24 Sept 2025, Last Modified: 24 Sept 2025
NeurIPS 2025 LLM Evaluation Workshop Poster
Readers:
Everyone
«
‹
1
2
3
4
5
6
7
8
›
»