SMILE: A Composite Lexical-Semantic Metric for Question-Answering Evaluation

ACL ARR 2025 July Submission144 Authors

24 Jul 2025 (modified: 28 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Traditional evaluation metrics for textual and visual question answering—like ROUGE, METEOR, and Exact Match (EM)—focus heavily on n-gram based lexical similarity, often missing the deeper semantic understanding needed for accurate assessment. While measures like BERTScore and MoverScore leverage contextual embeddings to address this limitation, they lack flexibility in balancing sentence-level and keyword-level semantics and ignore lexical similarity, which remains important. Large Language Model (LLM) based evaluators, though powerful, come with drawbacks like high costs, bias, inconsistency, and hallucinations. To address these issues, we introduce SMILE: Semantic Metric Integrating Lexical Exactness, a novel approach that combines sentence-level semantic understanding with keyword-level semantic understanding and easy keyword matching. This composite method balances lexical precision and semantic relevance, offering a comprehensive evaluation. Extensive benchmarks across text, image, and video QA tasks show SMILE is highly correlated with human judgments and computationally lightweight, bridging the gap between lexical and semantic evaluation.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: Question answering evaluation, language evaluation, evaluation metrics, semantic similarity, lexical similarity
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency
Languages Studied: English
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: N/A
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: We cited all benchmark dataset creator (See Section 5). Additionally, the code developed is original.
B2 Discuss The License For Artifacts: Yes
B2 Elaboration: We used publicly available datasets with open licenses. The code developed is original and will be released under an open-source license.
B3 Artifact Use Consistent With Intended Use: Yes
B3 Elaboration: The benchmark datasets used are publicly available under permissive research licenses.
B4 Data Contains Personally Identifying Info Or Offensive Content: N/A
B5 Documentation Of Artifacts: Yes
B5 Elaboration: Sec 5 - Benchmarks and generator models, talks about dataset used along with domain it covers.
B6 Statistics For Data: Yes
B6 Elaboration: Detailed analysis of data used for evaluation is provided under Sec 5 of the paper.
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: Sec 5 talks about the models used for experimentation setup along with it's size/ parameters
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: Sec 5 elaborately explains experimental setup along with ablation studies conducted to identify best set of hyperparameters/ settings configurations.
C3 Descriptive Statistics: Yes
C3 Elaboration: We present our results/ stats via different tables and figures.
C4 Parameters For Packages: Yes
C4 Elaboration: We do mention different packages used along with the configuration details for the same.
D Human Subjects Including Annotators: Yes
D1 Instructions Given To Participants: Yes
D1 Elaboration: Humman annotation details are explained in Sec 5, along with detailed explanations presented in Appendix (see Sec B)
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: Yes
D5 Elaboration: Annotator details are highlighted in sec 5
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: No
E1 Elaboration: We used AI assistants solely for initial brainstorming and minor rephrasing of non-technical text. No content generation, coding, analysis, or substantive scientific writing was performed by AI. Since the AI's involvement was limited and did not affect the research substance, we did not include this in the main text.
Author Submission Checklist: yes
Submission Number: 144
Loading