LawngNLI: a multigranular, long-premise NLI benchmark for evaluating models’ in-domain generalization from short to long contexts
Abstract: Natural language inference has trended with NLP toward studying reasoning over long contexts, with several datasets moving beyond the sentence level. However, short-sequence models typically perform best despite their sequence limits. Confounded by domain shifts between datasets, it has remained unclear whether long premises are truly needed at fine-tuning time to learn long-premise NLI. We construct LawngNLI, with premises that skew much longer than in existing NLI benchmarks and are multigranular: all contain a short version. LawngNLI is constructed from U.S. legal opinions, with automatic labels with high human-validated accuracy. Evaluating on its long-premise NLI, we show top performance is achieved only with fine-tuning using these long premises. Models only fine-tuned on existing datasets and even our short premises (which derive from judge-selected relevant Entail excerpts in source documents) thus controlling for domain underperform considerably. Top performance is by short-sequence models prepended with a standard retrieval method filtering across each premise, but they underperform absent fine-tuning using long premises as inputs. LawngNLI also holds relevance for the legal community, as NLI is a principal cognitive task in developing cases and advice. Models performing well could double as retrieval or implication scoring systems for legal cases.
Paper Type: short
0 Replies
Loading