Assessing the Difficulty of Inference Types in Natural Language Inference for Clinical Trials

ACL ARR 2025 May Submission3647 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: LLMs achieve competitive results on Natural Language Inference (NLI) when applied to clinical trials; however, it is not yet clear on which type of inference LLMs perform well or not. We address this by proposing new supplementary annotations to the existing NLI4CT dataset on the types of inference observed in clinical trials. Our dataset supplements NLI4CT with a total of 1,145 new annotations using our 6 types of inferences. To enhance explainability, we also provide the justifications associated with the labels for a sample of 50 statements. To know on which type of inference LLMs perform worse or better, we prompt Flan-T5, Llama, Mistral, and Qwen and investigate their performance using our newly annotated dataset. We observe that for Flan-T5 and MMed-Llama, the presence of biomedical inference has a positive impact on the overall performance, while for Mistral and MMed-Llama, common knowledge has a negative impact, and for Flan-T5, numerical and linguistic inference have a negative impact. Our code is publicly available on GitHub, and the dataset on HuggingFace.
Paper Type: Short
Research Area: Semantics: Lexical and Sentence-Level
Research Area Keywords: natural inference, textual entailment, clinical NLP
Contribution Types: Model analysis & interpretability, Data resources, Data analysis
Languages Studied: English
Submission Number: 3647
Loading