Abstract: This paper presents a controlled comparison of prompting strategies for the COLIEE Task 4 legal entailment task under a unified experimental setting. Using a Japanese instruction-following LLM, we compare zero-shot (ZS), zero-shot Chain-of-Thought (ZS-CoT), and structured Chain-of-Thought (Structured-CoT). Across seven test years (2018–2024), the overall (sample-weighted) accuracy improves along the progression ZS \(\rightarrow \) ZS-CoT \(\rightarrow \) Structured-CoT under the unified setting, suggesting that prompt structure—rather than model changes—drives the gain in this setting. In addition, we report few-shot prompting with similarity-based retrieval and label balancing as a secondary sensitivity analysis over the number of shots. Few-shot effects vary by year and n, and we do not observe a consistent monotonic improvement as n increases.
External IDs:dblp:journals/rss/OnagaK26
Loading