Additional Submission Instructions: For the camera-ready version, please include the author names and affiliations, funding disclosures, and acknowledgements.
Track: Track 1: Original Research/Position/Education/Attention Track
Keywords: citation verification, retrieval-augmented systems, scientific information retrieval, BM25, SPECTER2, isotonic calibration, large language models, peer review tooling, retraction signals, OpenAlex, Semantic Scholar, FAISS
Abstract: Accurate citations are essential for reproducibility and cumulative scientific progress, yet citation errors remain common and rarely receive systematic scrutiny in automated reviewing workflows. We introduce CiteGuard, a fast and auditable citation verifier that combines high-coverage retrieval with scientific-domain embeddings and lightweight LLM adjudication. CiteGuard extracts every in-text citation, retrieves candidate sources via a BM25+SPECTER2 fusion, and computes an interpretable alignment score that aggregates DOI agreement, robust title similarity, SPECTER2 semantic similarity, and venue/year compatibility. The score is calibrated to probability with isotonic regression and only uncertain cases are escalated to a small language model for a deterministic judgment. Evaluated on RealCitationErrors-500 (500 arXiv/PMC papers; 7,221 citations; 813 errors), CiteGuard achieves paper-level F1=0.95 and citation-level P=0.82, R=0.97, F1=0.89±0.02 (95% cluster bootstrap over papers), outperforming strong retrieval and LLM baselines while maintaining high precision. Median end-to-end latency is 11.7 s per paper with 18% of citations escalated; median per-review cost is USD 0.0028 under July 2025 small-LLM pricing. A within-subject user study (n=28) prefers reviews augmented with CiteGuard in 72% of blinded comparisons (Wilcoxon signed-rank p=0.007, Cliff’s δ=0.62). An ablation analysis indicates that SPECTER2 and multi-hit retrieval primarily drive recall, while calibrated escalation improves precision. Performance declines on low-resource humanities texts (F1=0.76), motivating domain adaptation.
Submission Number: 481
Loading