{
  "query": "What kinds of summarization models are there?",
  "query-id": 60,
  "gold_response": "AI Text Summarization: What kinds of models are there? 🤖\n\nAt a glance:\n- Extractive 🔍: selects important sentences/phrases from the original text. Typical methods: TF‑IDF, TextRank/LexRank, BERTSum.\n- Abstractive ✍️: understands and rewrites the gist in new words. Typical models: Seq2Seq+Attention/Pointer‑Generator, Transformers like BART, T5, PEGASUS (and long‑doc variants like LED/BigBird‑PEGASUS).\n\n1) Extractive summarization 🔍\n- What it does: “smart highlighter” — builds a summary by picking original sentences.\n- How it works:\n  • Frequency scoring (TF‑IDF/BM25): rank sentences by how many distinctive terms they contain; pick top‑k.\n  • Graph‑based ranking (TextRank, LexRank): build a sentence graph using similarity (e.g., cosine of TF‑IDF); PageRank‑style scoring picks central sentences; often add redundancy control (e.g., Maximal Marginal Relevance).\n  • Supervised neural extractors (BERTSum, SummaRuNNer, PacSum): encode the document (often with BERT/Transformers) and classify each sentence as “keep/skip”; then de‑duplicate and order.\n- Pros 👍:\n  • Faithful to source (low hallucination). Fast, simple, and transparent.\n  • Works well even with limited data; good baseline for long articles.\n- Cons 👎:\n  • Can read choppy because it stitches sentences. Cannot paraphrase or fuse dispersed info.\n\n2) Abstractive summarization ✍️\n- What it does: “creative rewriter” — generates new sentences that capture the core meaning.\n- How it works:\n  • Seq2Seq with Attention: an encoder reads the text; a decoder writes a summary token by token while “attending” to key parts. Pointer‑Generator variants can copy rare words and reduce repetition via coverage.\n  • Transformer‑based models:\n    - BART: denoising autoencoder that excels at fluent, faithful rewrites.\n    - T5: “text‑to‑text” model; add a prompt like \"summarize:\" and decode a summary.\n    - PEGASUS: pre‑trained for summarization using gap‑sentence prediction, often very strong on news/science.\n    - Long‑doc options: LED (Longformer Encoder‑Decoder), BigBird‑PEGASUS handle thousands of tokens.\n- Pros 👍:\n  • Natural, concise, human‑like summaries; can compress and combine information across sentences.\n- Cons 👎:\n  • May hallucinate or omit facts; heavier compute; slower than extractive for long inputs.\n\n3) Quick chooser 🧭\n- Want speed, faithfulness, and simple setup? Choose Extractive.\n- Want smoother, shorter, human‑style summaries (and can afford compute)? Choose Abstractive.\n- Many real systems use Hybrid pipelines: extract key bits first, then rewrite abstractively.\n\n4) Effective input and preprocessing 🧰\n- Data format (for training or organized datasets):\n  • Document–summary pairs. Commonly stored as JSONL lines like {\"document\": \"...\", \"summary\": \"...\"} or two aligned text files (src.txt, tgt.txt), one example per line.\n- Preprocessing essentials:\n  • Cleaning: remove HTML/boilerplate, ads, headers/footers, duplicate lines; normalize whitespace and quotes.\n  • Sentence segmentation: split clearly into sentences (crucial for extractive/graph methods); keep paragraph breaks if helpful.\n  • Tokenization:\n    - Classical extractive: word tokens; often use lowercasing, stop‑word removal, stemming/lemmatization to improve TF‑IDF/TextRank quality.\n    - Neural/Transformer: use subword tokenizers (WordPiece/BPE). Do NOT remove stop‑words or punctuation aggressively (models learned with them).\n  • Length handling:\n    - Truncation/padding to model limits (e.g., 512–4096+ tokens depending on model).\n    - For long texts: chunk into overlapping windows, summarize each chunk, then summarize the summaries (hierarchical). Keep original order and use separators between chunks.\n  • Redundancy control: de‑duplicate near‑identical sentences before extractive ranking; for abstractive decoding, use n‑gram blocking to avoid repeated phrases.\n  • Task hints: for models like T5, prepend a short instruction (e.g., \"summarize in 2 sentences:\"). You can also include desired length (\"~50 words\").\n\n5) Tiny example 🧪\nOriginal (3 sentences): \"Deep learning has rapidly advanced AI. In NLP, tools like summarization and translation are changing daily life. These advances come from models trained on large datasets.\"\n- Extractive (pick important sentences): \"Deep learning has rapidly advanced AI. In NLP, tools like summarization and translation are changing daily life.\"\n- Abstractive (rewrite concisely): \"Deep learning–driven NLP models, trained on massive data, are reshaping everyday life through tools like summarization and translation.\"\n\nThat’s the big picture: extractive (pick from text) vs. abstractive (rewrite the gist), with clear trade‑offs. If you share your use case or input length, I can suggest a concrete model and preprocessing recipe 😊",
  "gold_information": [
    "Summarization approaches fall into extractive and abstractive categories.",
    "Extractive summarization selects sentences or phrases directly from the source.",
    "Extractive frequency-based methods rank sentences by distinctive term usage.",
    "Extractive graph-based methods rank sentences by similarity centrality with redundancy control.",
    "Supervised neural extractive models classify sentences as keep or skip using contextual encoders.",
    "Extractive methods are fast, simple, transparent, and faithful to the source.",
    "Extractive methods can read choppy and cannot paraphrase or fuse dispersed information.",
    "Abstractive summarization generates new sentences that capture the core meaning.",
    "Sequence-to-sequence models with attention perform abstractive summarization.",
    "Copy and coverage mechanisms help abstractive models handle rare words and reduce repetition.",
    "Transformer-based encoder–decoder models are widely used for abstractive summarization.",
    "Long-document transformer variants handle inputs spanning thousands of tokens.",
    "Abstractive methods produce natural, concise summaries and can combine information across sentences.",
    "Abstractive methods may hallucinate or omit facts and require more compute.",
    "Many practical systems use hybrid pipelines that first extract and then rewrite.",
    "Training data typically consists of aligned document–summary pairs.",
    "Preprocessing includes cleaning boilerplate, headers, footers, and duplicate lines.",
    "Sentence segmentation is important for extractive and graph-based methods.",
    "Classical extractive tokenization uses word tokens with optional stop-word removal and stemming.",
    "Neural model tokenization uses subword units and retains stop-words and punctuation.",
    "Long texts can be summarized hierarchically by chunking and then summarizing the summaries.",
    "Redundancy can be reduced by de-duplication for extractive methods and n-gram blocking during decoding.",
    "Short instructions or length hints in the input can guide some models to produce the desired summary."
  ]
}