Abstract: Large language models (LLMs) have become powerful tools for understanding documents and answering questions (QA). The grounding of these answers consistently in facts in the given documents may be achieved by citing them in the generated responses. Several approaches to Retrieval Augmented Generation (RAG) have been proposed that incorporate citation to relevant documents to enhance correctness and verifiability. However, evaluation if the document is cited accurately, relies heavily on large generative models for Natural Language Inference. In this work, we evaluate various models in different evaluation schemes for the citation verification task to provide insights into how these models perform and in which evaluation schemes they excel. Our findings show that the TRUE T5 model performs well in verifying the completeness of citations, but struggles when only partial information is available. We also demonstrate that general LLMs can perform citation verification effectively, although the results in citation addition on an already generated answer as post-processing are still suboptimal. We argue that it is important to be mindful of how citation verifiers are used and understand their strengths and limitations. Furthermore, we trained a small and lightweight model, CiteVerifier, which performs exceptionally well despite being magnitudes smaller than other models, making it an ideal solution for low-resource settings.
External IDs:dblp:conf/iccS/WojtasikDP25
Loading