Keywords: robustness, knowledge acquisition, knowledge graphs, factual retrieval, multi-hop reasoning
TL;DR: We introduce RANK, a testbed for evaluating knowledge acquisition robustness, and use it to study the extent to which SFT and ICL enable language models to reason over newly acquired knowledge
Abstract: Language models acquire vast knowledge during pretraining, but adding new knowledge to pre-trained models often lacks robustness—models can retrieve individual facts but struggle with multi-hop reasoning over newly acquired knowledge and its implications. To systematically study this robustness gap, we introduce RANK (Robust Acquisition of New Knowledge), a testbed using synthetic knowledge graphs to evaluate knowledge acquisition via $k$-hop reasoning tasks of increasing complexity. Our evaluation of supervised fine-tuning (SFT) and in-context learning (ICL) using RANK reveals that ICL performance degrades with reasoning complexity and knowledge scale, while SFT trained on simple facts fails completely at multi-hop reasoning. However, we find that increasing training data diversity induces a sharp phase transition of fine-tuned models from memorization to out-of-distribution generalization. More generally, RANK enables controlled experiments that reveal insights into knowledge acquisition robustness.
Serve As Reviewer: ~Harshay_Shah1
Submission Number: 51
Loading