The COLIEE 2025 Competition on Legal Information Extraction and Entailment: Overview, Discussion, and Dataset Expansion

Randy Goebel, Yoshinobu Kano, Mi-Young Kim, Calum Kwan, Juliano Rabelo, Ken Satoh, Hiroaki Yamada, Masaharu Yoshioka

Published: 2026, Last Modified: 28 May 2026Rev. Socionetwork Strateg. 2026EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: We summarize the 12th Competition on Legal Information Extraction and Entailment. In this edition, the competition included two tasks on case law, two tasks on statute law, plus a new pilot task on Tort law. The two case law components include an information retrieval task (Task 1), and the confirmation of an entailment relation between an existing case and an unseen case (Task 2). The statute law components include an information retrieval task (Task 3), and an entailment/question-answering task based on retrieved civil code statutes (Task 4). The new pilot task involves tort prediction and rationale extraction. Participation was open to any group using any approach. Eight teams submitted a total of 21 runs for Task 1, achieving a top F1 score of 0.3604, with dominant approaches featuring multi-stage retrieval pipelines combining traditional IR with neural re-ranking methods. We summarize the variety of approaches, provide our official evaluation, and give a summary analysis of our data and submission results. In Task 2, the NOWJ team, which used BM25 for retrieval and DeepSeek-V3 and Qwen/QwQ-32B for reranking, achieved the best score of 0.3195. Eight teams submitted a total of 22 runs for Task 3. The best-performing system employs a multi-stage retrieval approach: it first retrieves a limited number of candidate articles in the initial stage, then applies an LLM-based cross-encoder for re-ranking, and finally determines the relevant articles using multiple LLMs. This system achieves nearly perfect retrieval performance for questions with a single relevant article; however, it still faces challenges in retrieving all relevant articles for questions that have multiple relevant answers. 11 teams submitted a total of 29 runs for Task 4, achieving a top accuracy score of 0.9041, where the solution uses an LLM coupled with a prompt engineering approach. Most teams used LLMs but their approaches and the models used were quite different. The new pilot task received 10 runs from four teams. All the teams employed LLMs. The best performing runs are JAIST-LJPJT25 (acc.=0.765) by the CAPTAIN team for the tort prediction task and KIS5 (F1=0.712) by the KIS team for the rationale extraction task. Finally, based on the strong performance observed in Tasks 3 and 4 this year, we propose introducing a new task for the next COLIEE, focusing on statute retrieval.

External IDs:dblp:journals/rss/GoebelKKKRSYY26a