SMILE: A Composite Lexical-Semantic Metric for Question-Answering Evaluation

SMILE: A Composite Lexical-Semantic Metric for Question-Answering Evaluation

ACL ARR 2025 July Submission144 Authors

24 Jul 2025 (modified: 28 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Traditional evaluation metrics for textual and visual question answering—like ROUGE, METEOR, and Exact Match (EM)—focus heavily on n-gram based lexical similarity, often missing the deeper semantic understanding needed for accurate assessment. While measures like BERTScore and MoverScore leverage contextual embeddings to address this limitation, they lack flexibility in balancing sentence-level and keyword-level semantics and ignore lexical similarity, which remains important. Large Language Model (LLM) based evaluators, though powerful, come with drawbacks like high costs, bias, inconsistency, and hallucinations. To address these issues, we introduce SMILE: Semantic Metric Integrating Lexical Exactness, a novel approach that combines sentence-level semantic understanding with keyword-level semantic understanding and easy keyword matching. This composite method balances lexical precision and semantic relevance, offering a comprehensive evaluation. Extensive benchmarks across text, image, and video QA tasks show SMILE is highly correlated with human judgments and computationally lightweight, bridging the gap between lexical and semantic evaluation.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: Question answering evaluation, language evaluation, evaluation metrics, semantic similarity, lexical similarity

Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency

Languages Studied: English

Reassignment Request Area Chair: This is not a resubmission

Reassignment Request Reviewers: This is not a resubmission

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: N/A

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: We cited all benchmark dataset creator (See Section 5). Additionally, the code developed is original.

B2 Discuss The License For Artifacts: Yes

B2 Elaboration: We used publicly available datasets with open licenses. The code developed is original and will be released under an open-source license.

B3 Artifact Use Consistent With Intended Use: Yes

B3 Elaboration: The benchmark datasets used are publicly available under permissive research licenses.

B4 Data Contains Personally Identifying Info Or Offensive Content: N/A

B5 Documentation Of Artifacts: Yes

B5 Elaboration: Sec 5 - Benchmarks and generator models, talks about dataset used along with domain it covers.

B6 Statistics For Data: Yes

B6 Elaboration: Detailed analysis of data used for evaluation is provided under Sec 5 of the paper.

C Computational Experiments: Yes

C1 Model Size And Budget: Yes

C1 Elaboration: Sec 5 talks about the models used for experimentation setup along with it's size/ parameters

C2 Experimental Setup And Hyperparameters: Yes

C2 Elaboration: Sec 5 elaborately explains experimental setup along with ablation studies conducted to identify best set of hyperparameters/ settings configurations.

C3 Descriptive Statistics: Yes

C3 Elaboration: We present our results/ stats via different tables and figures.

C4 Parameters For Packages: Yes

C4 Elaboration: We do mention different packages used along with the configuration details for the same.

D Human Subjects Including Annotators: Yes

D1 Instructions Given To Participants: Yes

D1 Elaboration: Humman annotation details are explained in Sec 5, along with detailed explanations presented in Appendix (see Sec B)

D2 Recruitment And Payment: N/A

D3 Data Consent: N/A

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: Yes

D5 Elaboration: Annotator details are highlighted in sec 5

E Ai Assistants In Research Or Writing: Yes

E1 Information About Use Of Ai Assistants: No

E1 Elaboration: We used AI assistants solely for initial brainstorming and minor rephrasing of non-technical text. No content generation, coding, analysis, or substantive scientific writing was performed by AI. Since the AI's involvement was limited and did not affect the research substance, we did not include this in the main text.

Author Submission Checklist: yes

Submission Number: 144

Loading