Adaptive Originality Filtering: Rejection‑Based Prompting and RiddleScore for Culturally Grounded Multilingual Riddle Generation

ACL ARR 2025 July Submission1097 Authors

29 Jul 2025 (modified: 03 Sept 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Language models are increasingly tested on multilingual creativity, demanding culturally grounded, abstract generations. Standard prompting methods often produce repetitive or shallow outputs. We introduce Adaptive Originality Filtering (AOF), a prompting strategy that enforces novelty and cultural fidelity via semantic rejection. To assess quality, we propose RiddleScore, a metric combining novelty, diversity, fluency, and answer alignment. AOF improves Distinct-2 (0.915 in Japanese), reduces Self-BLEU (0.177), and raises RiddleScore (up to +57.1\% in Arabic). Human evaluations confirm fluency, creativity, and cultural fit gains. However, improvements vary: Arabic shows greater RiddleScore gains than Distinct-2; Japanese sees similar changes. Though focused on riddles, our method may apply to broader creative tasks. Overall, semantic filtering with composite evaluation offers a lightweight path to culturally rich generation—without fine-tuning.
Paper Type: Long
Research Area: Multilingualism and Cross-Lingual NLP
Research Area Keywords: Generation, Multilingualism and Cross‑Lingual NLP, Language Modeling, Efficient/Low‑Resource Methods for NLP, Resources and Evaluation, Stylistic Analysis and Argument Mining, Human-Centered NLP
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data analysis
Languages Studied: English, Chinese, French, Arabic, Japanese
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: Yes
A2 Elaboration: Section 8: Ethics Statement, under Misuse Risks and Interpretability — discusses potential misuse of metaphorical riddles for spreading misinformation or cultural manipulation and advises caution in educational or psychological contexts.
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: Sections 2 and 3 cite creators of datasets (BiRdQA), models (GPT-4o, LLaMA 3.1, DeepSeek), and evaluation tools (MiniLM, BERTScore, Distinct-2, Self-BLEU). Full citations appear in the References section.
B2 Discuss The License For Artifacts: Yes
B2 Elaboration: ection 8: Ethics Statement under Data Privacy and Responsible Fine-Tuning notes that fine-tuning used OpenAI’s API under terms compliant with licensing and safety guidelines. No proprietary data was included.
B3 Artifact Use Consistent With Intended Use: Yes
B3 Elaboration: Section 8: Ethics Statement confirms that BiRdQA and models were used strictly for research, within the conditions allowed by the original dataset and APIs.
B4 Data Contains Personally Identifying Info Or Offensive Content: Yes
B4 Elaboration: Section 8: Ethics Statement, under Data Privacy, states that no personally identifiable or offensive content was used; riddles are anonymized and generalized.
B5 Documentation Of Artifacts: Yes
B5 Elaboration: Appendix N and Section 3 provide documentation of evaluation data, language coverage (EN, ZH, JA, FR, AR), prompt types, and linguistic traits (e.g., metaphor, misdirection).
B6 Statistics For Data: Yes
B6 Elaboration: Appendix N and Section 3.3 report dataset sizes (e.g., 6,614 English and 8,751 Chinese riddles), evaluation splits, and prompt sampling setup.
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: Section 4 and Appendix L describe the fine-tuned GPT-4o, though exact GPU hours are not reported due to API use. Pretrained model sizes are standard (e.g., GPT-4o, LLaMA 3.1).
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: Section 3.3 and Appendix L report temperature (0.7), prompt strategies, and tuning hyperparameters.
C3 Descriptive Statistics: Yes
C3 Elaboration: Section 5–8 and Tables 1–5 provide means, deltas (e.g., % improvement in RiddleScore, Self-BLEU), and comparisons across runs.
C4 Parameters For Packages: Yes
C4 Elaboration: Appendix N lists use of packages such as MiniLM (for novelty), GPT-2.5 (for fluency), BERTScore, and spaCy/Stanza (for syntactic validity), along with settings.
D Human Subjects Including Annotators: Yes
D1 Instructions Given To Participants: Yes
D1 Elaboration: Appendix P includes rubric prompts and Likert-scale instructions used in human evaluation (fluency, novelty, cultural fit, answerability).
D2 Recruitment And Payment: Yes
D2 Elaboration: Section 5.1 notes that annotators were native/proficient speakers. Payment details not specified, but evaluators were linguistically qualified.
D3 Data Consent: Yes
D3 Elaboration: Section 8: Ethics Statement, under Human Evaluation and Metric Ethics, confirms that annotators gave consent and data was anonymized.
D4 Ethics Review Board Approval: No
D4 Elaboration: IRB was not sought because no sensitive personal data was collected, and annotators participated voluntarily using anonymized, non-personal content.
D5 Characteristics Of Annotators: Yes
D5 Elaboration: Section 5 states annotators were native or expert speakers across five languages (EN, ZH, JA, FR, AR), though exact demographics were not recorded.
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: Yes
E1 Elaboration: Use of AI assistants is described in Section 8: Ethics Statement, under Creative Attribution and AI Authorship. AI tools were used for code debugging and manuscript editing (e.g., phrasing, transitions). All substantive contributions were authored by the researchers.
Author Submission Checklist: yes
Submission Number: 1097
Loading