SCALE: Scaling up the Complexity for Advanced Language Model Evaluation

Vishvaksenan Rasiah; Ronja Stern; Veton Matoshi; Matthias Stürmer; Ilias Chalkidis; Daniel E. Ho; Joel Niklaus

SCALE: Scaling up the Complexity for Advanced Language Model Evaluation

Vishvaksenan Rasiah, Ronja Stern, Veton Matoshi, Matthias Stürmer, Ilias Chalkidis, Daniel E. Ho, Joel Niklaus

22 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX

Keywords: legal, domain-specific, swiss, multilingual, long documents, multi-task, dataset, benchmark, evaluation, large language model

TL;DR: SCALE: A comprehensive multilingual multitask legal benchmark for long document processing

Abstract: Recent strides in Large Language Models (LLMs) have saturated many NLP benchmarks (even professional domain-specific ones), emphasizing the need for more challenging ones to properly assess LLM capabilities. In this work, we introduce a novel NLP benchmark that poses challenges to current LLMs across four key dimensions: processing long documents (up to 50K tokens), using domain-specific knowledge (embodied in legal texts), multilingual understanding (covering five languages), and multitasking (comprising legal document-to-document Information Retrieval, Court View Generation, Leading Decision Summarization, Citation Extraction, and eight challenging Text Classification tasks). Our benchmark contains diverse legal NLP datasets from the Swiss legal system, allowing for a comprehensive study of the underlying non-English, inherently multilingual, federal legal system. Despite recent advances, efficient processing of long documents for intense review/analysis tasks remains an open challenge for LLMs. In addition, comprehensive, domain-specific benchmarks requiring high expertise to develop are rare, as are multilingual benchmarks. This scarcity underscores our contribution's value, considering that most public models are trained predominantly on English corpora, while other languages remain understudied, particularly for practical domain-specific NLP tasks. Our benchmark allows for testing and advancing the state-of-the-art LLMs. As part of our study, we evaluate several pre-trained multilingual language models on our benchmark to establish strong baselines as a point of reference. Despite the large size of our datasets (tens to hundreds of thousands of examples), existing publicly available models struggle with most tasks, even after extensive in-domain pre-training. We publish all resources (benchmark suite, pre-trained models, code) under a fully permissive open CC BY-SA license.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 6493

Loading