{
    "title": "Decontextualization, Everywhere: A Systematic Audit on PeerQA",
    "core_idea": "Quantify how decontextualization templates (title+paragraph/sentence) impact retrieval and downstream answerability/answer generation across PeerQA’s domains. Compare gains across BM25, dense retrievers, ColBERT, and cross-encoder rerankers at sentence vs. paragraph granularity.",
    "hypothesis": "Decontextualization consistently improves retrieval effectiveness, but the magnitude varies by domain and granularity; over-decontextualization can introduce lexical drift that hurts answer generation.",
    "why_it_matters": "PeerQA highlights decontextualization as a key lever; a principled audit establishes best-practice templates and prevents overfitting to specific retriever families.",
    "abstract": "PeerQA is a real-world scientific QA dataset derived from peer reviews with three tasks: evidence retrieval, unanswerable question classification, and answer generation. We propose a systematic audit of decontextualization—the practice of prepending structural cues like titles to sentences/paragraphs—to quantify its effect on retrieval and downstream tasks. Using the official pipelines (BM25 via Pyserini, dense retrievers, ColBERT, and cross-encoder reranking) and PeerQA’s QA, papers, and qrels splits, we benchmark template variants at sentence- and paragraph-level. We further propagate retrieval differences to answerability and generation (Rouge, AlignScore, Prometheus), offering precise recommendations on when and how to decontextualize for long-document scientific QA.",
    "introduction": "Long-document QA in scientific literature requires effective retrieval of small but relevant spans. PeerQA provides author-verified evidence mappings and highlights that decontextualization often boosts retrieval. However, the optimal template, granularity, and retriever family remain unclear, and over-decontextualization may harm generative fidelity. This work delivers the first comprehensive, dataset-native audit of decontextualization across retrieval families, granularities (sentence vs. paragraph), and domains (ML/NLP vs. others). We also examine how retrieval gains translate to answerability and answer quality, solidifying decontextualization best practices for PeerQA.",
    "possible_methods": [
        "Template sweep for sentence/paragraph chunks: e.g., 'Title: {title} Sentence: {content}', 'Heading: {last_heading} Paragraph: {content}'",
        "Retriever comparisons: BM25 (Pyserini), dense (Contriever, GTE), ColBERTv2, cross-encoder rerankers (e.g., BGE-reranker)",
        "Downstream propagation to answerability (binary) and generation (free-form) using official scripts",
        "Domain-wise and granularity-wise stratified evaluation; significance testing"
    ],
    "experimental_design": [
        "Data loading: Use Hugging Face datasets to load 'UKPLab/PeerQA' splits: 'qa' (questions/labels), 'papers' or 'papers-all' (extracted text), and 'qrels-sentences'/'qrels-paragraphs' (relevance). Maintain a manifest of question_id ↔ paper_id ↔ qrels indices.",
        "Preprocessing: Generate multiple decontextualized views per unit (sentence/paragraph): minimal (content only), title+content, heading+content, title+heading+content. Persist processed corpora for each template and granularity.",
        "Retrieval: (a) BM25 via Pyserini indexes per template+granularity; (b) Dense retrieval using provided baselines (e.g., Contriever, GTE); (c) ColBERTv2 indexing/search; (d) Cross-encoder reranking on top-100 candidates. Evaluate Recall@{1,5,10,20,50}, nDCG@{10,20}.",
        "Downstream tasks: Feed top-k contexts (k∈{5,10,20,50,100}) into generate.py for (i) answerability prompts (full-text, RAG, gold) and (ii) answer generation prompts (full-text, RAG, gold). Evaluate with generations_evaluate_answerability.py and generations_evaluate_rouge_alignscore.py/Prometheus.",
        "Stratified analysis: Report metrics by domain (ML/NLP vs. others), question length bins, and paper length bins. Compare sentence vs. paragraph granularities within each template and retriever.",
        "Significance + cost: Use bootstrap CIs and paired tests to compare templates. Log indexing time, query latency, and memory footprint to provide cost–quality trade-offs.",
        "Ablations: (i) Remove titles or headings to measure marginal contribution; (ii) Test extreme templates (aggressive title duplication) to detect lexical drift harming generation; (iii) Mix sentence+paragraph evidence for hybrid retrieval.",
        "Deliverables: Publish a configuration-driven runner (YAML) to reproduce all settings, metric dashboards per template, and a concise 'best practices' guide for PeerQA users."
    ]
}