ScreenshotLegalBench: A Multimodal Benchmark for Legal Evidence Understanding in Chat Screenshots

ScreenshotLegalBench: A Multimodal Benchmark for Legal Evidence Understanding in Chat Screenshots

ACL ARR 2025 May Submission4449 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Chat screenshots from platforms such as WeChat are increasingly used as legal evidence in Chinese civil litigation. However, their informal layout, multimodal nature, and lack of structure pose significant challenges for automated understanding. We introduce \textbf{ScreenshotLegalBench}, the first large-scale multimodal benchmark for \textit{Legal Screenshot Evidence Understanding (LSEU)}. It supports two key tasks: (1) structured key information extraction and (2) legal visual question answering (VQA). The dataset contains over 4,600 chat screenshots annotated with 145,044 structured labels, a 143-image evaluation set with 2,678 verified annotations, and 1,176 VQA instances covering evidence relevance, format validity, and legal reasoning. Among these, 106 cases involve multi-image cause-of-action scenarios. We benchmark several open-source vision-language models (VLMs), including InternVL and Qwen-VL families. Experimental results show that current VLMs struggle with layout interpretation and domain-specific reasoning, despite instruction tuning. \textbf{ScreenshotLegalBench} offers a novel and scalable resource at the intersection of vision, language, and law, enabling future research on multimodal legal document understanding in real-world settings. The dataset and code are soon publicly available at \Github.

Paper Type: Long

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: Multimodal Learning, Legal NLP, Vision-Language Models, Dataset

Contribution Types: Publicly available software and/or pre-trained models, Data resources

Languages Studied: Chinese

Submission Number: 4449

Loading