Multi-MALLS: A Multilingual Benchmark for Natural Language to First-Order Logic Translation

Multi-MALLS: A Multilingual Benchmark for Natural Language to First-Order Logic Translation

ACL ARR 2026 January Submission5754 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multilingualism and Cross-Lingual NLP， Resources and Evaluation，Semantic Parsing，Language Modeling

Abstract: Translating natural language to First-Order Logic (NL2FOL) is crucial for logic-based reasoning, but most existing models and benchmarks focus on English, leaving cross-lingual generalization underexplored. In this study, we present the first comprehensive investigation of cross-lingual robustness in NL2FOL. To support this, we introduce Multi-MALLS, the first multilingual benchmark for NL2FOL, which extends the widely-used MALLS dataset with high-quality translations into multiple languages. Using Multi-MALLS, we observe that state-of-the-art models experience a significant performance drop when applied to non-English inputs, highlighting the lack of robustness. To address this, we propose MA-FOL, a multi-agent framework that improves generalization without requiring any training on target languages. By decomposing the task into three modules, a language-agnostic structure generator, a language-specific predicate combiner, and a refinement component, MA-FOL achieves robust zero-shot generalization across diverse linguistic inputs. Additionally, we show that traditional evaluation metrics, such as Exact Match, often fail to assess semantic correctness. To remedy this, we introduce large language models (LLMs)-Judged Semantic Equivalence (SE), a new metric that leverages LLMs to evaluate whether generated formulas preserve the intended meaning. Extensive experiments demonstrate that MA-FOL outperforms strong baselines on Multi-MALLS, without any multilingual fine-tuning. The SE metric further reveals semantic correctness that traditional metrics miss. Overall, our work provides a benchmark to test robustness of NL2FOL, a framework to improve it, and a metric to evaluate it more effectively.

Paper Type: Long

Research Area: Machine Learning for NLP

Research Area Keywords: Multilingualism and Cross-Lingual NLP， Resources and Evaluation，Semantic Parsing，Language Modeling

Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Data resources

Languages Studied: English，Chinese，Japanese，Russian，French，German

Submission Number: 5754

Loading