Keywords: Information Retrieval, Explainable AI (XAI), Retrieval-Augmented Generation (RAG), Web Quality, Human-centered NLP
Abstract: Webpages increasingly serve two audiences: *humans*, who judge credibility and usefulness, and *machines*, which surface pages in retrieval-augmented generation (RAG) pipelines. Yet it remains unclear how improving a page for human readers affects its visibility to dense retrievers. To study this question, we introduce WEBQX, a three-part framework built on the *WebQuality* dataset of 60k webpages annotated along five human-centric dimensions. The framework contains: (1) WEBQX-Estimator, which predicts perceived quality from structural HTML features and exposes feature-level weaknesses using SHAP explanations; (2) WEBQX-OptAgent, a two-agent LLM pipeline that performs targeted HTML rewrites guided by these explanations; and (3) WEBQX-RAGEval, a retrievability evaluation module that evaluates how SHAP-guided HTML edits affect dense retrievability.
Our experiments show that although SHAP-guided rewrites consistently improve predicted human quality, they systematically \emph{degrade} dense retrieval performance at both page- and index-level metrics.
Together, these results provide the first large-scale evidence of a structural misalignment between human-centered improvements and dense retrievability, highlighting the need for joint optimization strategies in RAG-mediated web access.
We will release the code and trained components for reproducibility: https://anonymous.4open.science/r/webqxaisq-B38F/README.md.
Paper Type: Long
Research Area: Information Extraction and Retrieval
Research Area Keywords: Information Retrieval and Text Mining, Human-Centered NLP, Interpretability and Analysis of Models for NLP,
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: Chinese, English, French, German, Spanish
Submission Number: 3499
Loading