Grounding Terms from an Ontology for use in Autoformalization: Tokenization is All You Need

Published: 29 Aug 2025, Last Modified: 29 Aug 2025NeSy 2025 - Phase 2 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Autoformalization, Natural Language Processing, Ontology, LLM
TL;DR: By re-mapping terms in a large ontology, syntax and type errors during autoformalization of natural language using an LLM are greatly reduced.
Abstract: Large Language Models (LLMs) have shown strong performance in translating natural language into programming languages like Python or Java. However, for niche computer languages, where there is limited training data, fine-tuning a base model is often necessary. A key challenge arises when the pretrained embeddings of natural language terms interfere with the intended syntax and semantics of formal language terms. This issue is especially pronounced in the logical language of SUO-KIF, which is used in the Suggested Upper Merged Ontology (SUMO). SUMO contains thousands of terms that closely resemble everyday English words. As a result, models often produce syntactic errors or hallucinate non-existent terms due to conflicting embeddings learned during base training. This work introduces a tokenization-based technique to mitigate these issues. By altering how formal terms are tokenized, we can decouple their embeddings from similar natural language words, significantly reducing syntax errors and term hallucinations in the generated formal language output.
Track: Knowledge Graphs, Ontologies and Neurosymbolic AI
Paper Type: Short Paper
Resubmission: No
Software: https://github.com/ontologyportal
Publication Agreement: pdf
Submission Number: 54
Loading