Keywords: Multimodal Large Language Model, Benchmark, Dataset, Ancient Manuscript
TL;DR: We introduce MS-Bench, a benchmark for ancient manuscript analysis, revealing how LMMs handle philological tasks, highlighting the potential for future human-AI collaboration.
Abstract: Analyzing ancient manuscripts has traditionally been a labor-intensive and time-consuming task for philologists. While recent advancements in LMMs have demonstrated their potential across diverse domains, their effectiveness in manuscript study remains underexplored. In this paper, we introduce MS-Bench, the first comprehensive benchmark co-developed with archaeologists, comprising 5,076 high-resolution images from 4th to 14th century and 9,982 expert-curated questions across nine sub-tasks aligned with archaeological workflows. Through four prompting strategies, we systematically evaluate 32 LMMs on their effectiveness, robustness, and cultural contextualization. Our analysis reveals scale-driven performance and reliability improvements, prompting strategies' impact on performance (CoT has two-sides effect, while visual retrieval-augmented prompts provide consistent boost), and task-specific preferences depending on LMM’s visual capabilities. Although current LMMs are not yet capable of replacing domain expertise, they demonstrate promising potential to accelerate manuscript research through future human–AI collaboration.
Croissant File: json
Dataset URL: https://huggingface.co/datasets/ClawCardMirror/MS-Bench/tree/main
Code URL: https://github.com/ianeong/MS-Bench
Supplementary Material: pdf
Primary Area: AL/ML Datasets & Benchmarks for social sciences (e.g. climate, health, life sciences, physics, social sciences)
Submission Number: 1201
Loading