MS-Bench: Evaluating LMMs in Ancient Manuscript Study through a Dunhuang Case Study

Yuqing Zhang; Yue Han; Shuanghe Zhu; Haoxiang Wu; Hangqi Li; Shengyu Zhang; Junchi Yan; Zemin Liu; Kun Kuang; Huaiyong Dou; Yongquan Zhang; Fei Wu

MS-Bench: Evaluating LMMs in Ancient Manuscript Study through a Dunhuang Case Study

Yuqing Zhang, Yue Han, Shuanghe Zhu, Haoxiang Wu, Hangqi Li, Shengyu Zhang, Junchi Yan, Zemin Liu, Kun Kuang, Huaiyong Dou, Yongquan Zhang, Fei Wu

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Large Language Model, Benchmark, Dataset, Ancient Manuscript

TL;DR: We introduce MS-Bench, a benchmark for ancient manuscript analysis, revealing how LMMs handle philological tasks, highlighting the potential for future human-AI collaboration.

Abstract: Analyzing ancient manuscripts has traditionally been a labor-intensive and time-consuming task for philologists. While recent advancements in LMMs have demonstrated their potential across diverse domains, their effectiveness in manuscript study remains underexplored. In this paper, we introduce MS-Bench, the first comprehensive benchmark co-developed with archaeologists, comprising 5,076 high-resolution images from 4th to 14th century and 9,982 expert-curated questions across nine sub-tasks aligned with archaeological workflows. Through four prompting strategies, we systematically evaluate 32 LMMs on their effectiveness, robustness, and cultural contextualization. Our analysis reveals scale-driven performance and reliability improvements, prompting strategies' impact on performance (CoT has two-sides effect, while visual retrieval-augmented prompts provide consistent boost), and task-specific preferences depending on LMM’s visual capabilities. Although current LMMs are not yet capable of replacing domain expertise, they demonstrate promising potential to accelerate manuscript research through future human–AI collaboration.

Croissant File: json

Dataset URL: https://huggingface.co/datasets/ClawCardMirror/MS-Bench/tree/main

Code URL: https://github.com/ianeong/MS-Bench

Supplementary Material: pdf

Primary Area: AL/ML Datasets & Benchmarks for social sciences (e.g. climate, health, life sciences, physics, social sciences)

Submission Number: 1201

Loading