This section reports performance across the three knowledge graph question-answering benchmarks we described in Section~\ref{sec:datasets}: Tri-REx, SimpleQuestions, and WebQSP.
WebQSP is a crucial benchmark, as none of our models were exposed to this dataset during training or validation.
A summary of the results is shown in Figure~\ref{fig:test_results}.




\begin{figure}
    \centering
    \includegraphics[width=1\linewidth]{images/final_results_hit_1-5.pdf}
    \caption{
        Test Results for our baselines and models on the three datasets. Shaded and full bars indicate Hit\@1 and Hit\@5, respectively.
    }
    \label{fig:test_results}
\end{figure}

\subsection{Tri-REx}
\label{subsec:Tri-REx-test}
We first look at the performance on our main training dataset, Tri-REx.
Table~\ref{tab:Tri-REx_test_results} presents the test set performance, comparing it with our selected baselines.
Compared to the base model, our approach hugely improves the performance of the base model.
The comparison with \textbf{ConceptFormer} reveals competitive performance at Hit@1 (28.1\% vs 16.8\%) and Hit@10 results (63.5\% vs 36.9\%); from these initial results, it appears that injecting knowledge vectors scales with model size.
In this dataset, the \textbf{textualization} baseline outperforms our approach (33.9\% vs 28.1\% Hit@1).
However, our approach's competitive performance demonstrates that quantized representations can effectively capture and utilize factual knowledge.
Comparing our model to \textbf{LoRA-only} adaptation, we can clearly see how our model successfully generalizes to new knowledge. 
The tri-REx dataset, as discussed in Section~\ref{sec:datasets}, is designed to prevent any factual overlap between the training and test sets. 
Models that rely solely on memorizing the training data facts cannot achieve a good performance on the test set.
\begin{table}[h]
\centering
\caption{Tri-REx test set results: Performance of best KoRe configurations.}
\label{tab:Tri-REx_test_results}
\begin{tabular}{lccccr}
\toprule
Configuration & Hit@1 & Hit@3 & Hit@5 & Hit@10 & AvgTokens \\
\midrule
\multicolumn{6}{c}{\textbf{Baselines}}\\
Original LLM & 3.8 & 10.4 & 15.6 & 22.3 & 35.9\\
Textualization & \textbf{33.9} & \textbf{61.3} & \textbf{68.9} & \textbf{75.9} & 252.2\\
Only LoRA (Tri-REx) & 0.0 & 0.0 & 0.0 & 0.2 & 35.9 \\
ConceptFormer (20 CF variant) & 16.8 & -- & 31.2 & 36.9 & --\\
\midrule
\multicolumn{6}{c}{\textbf{KoRe}}\\
\textbf{KoRe-base} & \underline{28.1} & \underline{46.7} & \underline{54.7} & \underline{63.5} & 70.4 \\
\textbf{KoRe-QA}   & 24.3 & 40.6 & 47.7 & 56.9 & 70.4\\
\bottomrule
\end{tabular}
\end{table}










\subsection{SimpleQuestions}
\label{subsec:simplequestions-test}
The SimpleQuestions test set results demonstrate the power of our discrete knowledge representation approach. 
Table~\ref{tab:simpleQA_test_results} presents the test set performance, comparing it with our selected baselines.
The zero-shot transfer performance (24.5\% Hit@1) substantially exceeds the original language model baseline (1.8\% Hit@1), confirming that quantized knowledge graphs can effectively convey factual information without task-specific training.
The QA fine-tuned model shows even more remarkable performance (51.5\% Hit@1), substantially surpassing all the baselines.
ConceptFormer is not present in this evaluation as it was not tested on this dataset in the original work.

A notable pattern emerges in the textualization baseline: Hit@1 is only 6.2\%, yet Hit@3 recovers sharply to 46.7\% and Hit@10 reaches 82.5\%. This wide gap is explainable by the exact-match evaluation we use: when presented with serialized triples, the LLM may produce a conversational preamble before stating the answer, penalizing it at rank 1 even though the correct entity is ranked highly. By contrast, KoRe's LoRA fine-tuning on discrete token prefixes may implicitly encourage more direct answer generation, yielding higher Hit@1 scores from the same underlying factual content. 
We leave a controlled analysis of this generation-style effect to future work.


\begin{table}[h]
\centering
\caption{SimpleQuestions test set results: Performance of best KoRe configurations.}
\label{tab:simpleQA_test_results}
\begin{tabular}{lccccr}
\toprule
Configuration & Hit@1 & Hit@3 & Hit@5 & Hit@10 & AvgTokens \\
\midrule
\multicolumn{6}{c}{\textbf{Baselines}}\\
Original LLM & 1.8 & 7.8 & 12.3 & 23.3 & 29.7\\
Textualization & 6.2 & \underline{46.7} & \underline{65.6} & \textbf{82.5} & 248.2 \\
Only LoRA (Tri-REx) & 0.8 & 0.9 & 1.0 & 1.4 & 29.7 \\
\midrule
\multicolumn{6}{c}{\textbf{KoRe}}\\
\textbf{KoRe-base} & \underline{24.5} & 36.1 & 39.9 & 46.9 & 64.2\\
\textbf{KoRe-QA}   & \textbf{51.5} & \textbf{67.1} & \textbf{72.1} & \underline{77.8} & 64.2\\
\bottomrule
\end{tabular}
\end{table}

\subsection{WebQSP}
\label{subsec:WebQSP-test}
The WebQSP evaluation provides valuable insights into the scalability and robustness of our quantized encoding scheme, as our models were never exposed to this dataset during any training stage.
Table~\ref{tab:WebQSP_test_results} reports the test set results for the baselines and our model.
Even without any exposure to WebQSP during training, the base model achieves a Hit@1 of 23.8\%, outperforming all baselines, including a version of ConceptFormer specifically fine-tuned on this dataset.
Remarkably, the fine-tuned variant of our model (\texttt{KoRe-QA}) reaches 51.4\% Hit@1 and 86.0\% Hit@10, more than doubling the accuracy of ConceptFormer and significantly surpassing the textualization baseline. 
Given that the textualization baseline consumes on average over \textbf{10$\times$ more tokens} per query, these findings further validate the token-efficient design of our discrete knowledge integration pipeline.
\begin{table}[h]
\centering
\caption{WebQSP test set results: Performance of best KoRe configurations.}
\label{tab:WebQSP_test_results}
\begin{tabular}{lccccr}
\toprule
Configuration & Hit@1 & Hit@3 & Hit@5 & Hit@10 & AvgTokens \\
\midrule
\multicolumn{6}{c}{\textbf{Baselines}}\\
Original LLM & 0.1 & 2.7 & 6.8 & 19.2 & 17.2 \\
Textualization & 1.8 & 12.4 & 22.8 & 41.7 & 676.5 \\
Only LoRA (Tri-REx) & 0.4 & 1.2 & 1.5 & 2.0 & 17.2 \\
ConceptFormer (QA fine-tuned on WebQSP) & 7.6 & -- & 28.3 & -- & -- \\
\midrule
\multicolumn{6}{c}{\textbf{KoRe}}\\
\textbf{KoRe-base} & \underline{23.8} & \underline{40.7} & \underline{49.1} & \underline{58.5} & 62.9\\
\textbf{KoRe-QA}   & \textbf{51.4} & \textbf{73.2} & \textbf{79.3} & \textbf{86.0} & 62.9\\
\bottomrule
\end{tabular}
\end{table}
