Keywords: Autoformalization, Natural Language Processing, Ontology, LLM
TL;DR: By re-mapping terms in a large ontology, syntax and type errors during autoformalization of natural language using an LLM are greatly reduced.
Abstract: Large Language Models (LLMs) have shown strong performance in translating natural
language into programming languages like Python or Java. However, for niche computer
languages, where there is limited training data, fine-tuning a base model is often necessary.
A key challenge arises when the pretrained embeddings of natural language terms interfere
with the intended syntax and semantics of formal language terms. This issue is especially
pronounced in the logical language of SUO-KIF, which is used in the Suggested Upper
Merged Ontology (SUMO). SUMO contains thousands of terms that closely resemble everyday
English words. As a result, models often produce syntactic errors or hallucinate
non-existent terms due to conflicting embeddings learned during base training.
This work introduces a tokenization-based technique to mitigate these issues. By altering
how formal terms are tokenized, we can decouple their embeddings from similar
natural language words, significantly reducing syntax errors and term hallucinations in the
generated formal language output.
Track: Knowledge Graphs, Ontologies and Neurosymbolic AI
Paper Type: Short Paper
Resubmission: No
Software: https://github.com/ontologyportal
Publication Agreement: pdf
Submission Number: 54
Loading