Grounding Terms from an Ontology for use in Autoformalization: Tokenization is All You Need

Richard Thompson; Adam Pease; Mathias Kölsch; Angelos Toutsios

Grounding Terms from an Ontology for use in Autoformalization: Tokenization is All You Need

Richard Thompson, Adam Pease, Mathias Kölsch, Angelos Toutsios

Published: 29 Aug 2025, Last Modified: 29 Aug 2025NeSy 2025 - Phase 2 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Autoformalization, Natural Language Processing, Ontology, LLM

TL;DR: By re-mapping terms in a large ontology, syntax and type errors during autoformalization of natural language using an LLM are greatly reduced.

Abstract: Large Language Models (LLMs) have shown strong performance in translating natural language into programming languages like Python or Java. However, for niche computer languages, where there is limited training data, fine-tuning a base model is often necessary. A key challenge arises when the pretrained embeddings of natural language terms interfere with the intended syntax and semantics of formal language terms. This issue is especially pronounced in the logical language of SUO-KIF, which is used in the Suggested Upper Merged Ontology (SUMO). SUMO contains thousands of terms that closely resemble everyday English words. As a result, models often produce syntactic errors or hallucinate non-existent terms due to conflicting embeddings learned during base training. This work introduces a tokenization-based technique to mitigate these issues. By altering how formal terms are tokenized, we can decouple their embeddings from similar natural language words, significantly reducing syntax errors and term hallucinations in the generated formal language output.

Track: Knowledge Graphs, Ontologies and Neurosymbolic AI

Paper Type: Short Paper

Resubmission: No

Software: https://github.com/ontologyportal

Publication Agreement: pdf

Submission Number: 54

Loading