{
  "query": "I want to know about prompt compression.",
  "query-id": 21,
  "gold_response": "Prompt compression: methods, post-2024 directions, and how to use them\n\nWhy it matters\n- Goal: Reduce input tokens while preserving task-relevant information to cut latency and cost without hurting quality.\n- Where it shines: Retrieval-Augmented Generation (RAG), multi-turn conversations, large docs, tool-using agents with long scratchpads.\n- Measurable effects: compression ratio (CR = original_tokens / compressed_tokens), end-to-end latency, throughput (tokens/sec), cost ($/1K tokens), and task metrics (e.g., EM/F1 on QA, ROUGE for summarization).\n\nMethod families (with core ideas, algorithms, trade-offs, and sources)\n\n1) Information-theoretic/extractive (no model changes)\n- Idea: Drop low-information content given the model’s next-token probabilities. Use self-information I(x) = −log P(x) to rank units (tokens/phrases/sentences) and keep the surprising or query-relevant pieces.\n- Representative methods\n  • Selective Context: Scores tokens or spans by I(x) and prunes low-I content. Often adds query-aware weighting to preserve relevant spans.\n  • LLMLingua / LongLLMLingua (Microsoft): Use a small LM to compute token salience; budget controller allocates different compression rates to instruction/context/question; LongLLMLingua is query-aware and handles very long prompts.\n- Algorithm sketch\n  1) Split prompt into sections (instruction, retrieved context, examples, question).\n  2) For each section, compute token probabilities P(x) with a small LM; compute I(x) = −log P(x).\n  3) Reweight scores by query relevance (e.g., TF-IDF overlap or cross-encoder relevance). Optionally penalize stop-words and keep structure tokens.\n  4) Choose target budget B tokens and select top-scoring spans until sum ≤ B; optionally reorder to mitigate “lost in the middle.”\n- Reported effects (author reports; verify on your data)\n  • LongLLMLingua: 2–20× compression; 1.4–2.6× end-to-end speedups on long prompts; up to ~4× fewer tokens; improvements on long-context RAG.\n  • Selective Context: ~2× effective context expansion; ~30% latency and memory savings at ~50% compression in some settings.\n- Pros/cons\n  • Pros: Black-box compatible; no finetuning; predictable; easy to integrate into RAG.\n  • Cons: Extra precompute (compression overhead); may lose subtle discourse cues; quality depends on salience scoring.\n- Sources: LLMLingua/LongLLMLingua (Microsoft Research; paper + GitHub), Selective Context (paper + library).\n\n2) Learned/soft prompts (model-parameter methods)\n- Idea: Compress history into a small set of learned “summary” tokens the model can interpret.\n- Representative methods\n  • GIST: Learn a few “gist tokens” that represent previous context; the model conditions on these instead of raw history.\n  • AutoCompressor: Train virtual summary tokens to store segment information; can be used recurrently across chunks.\n- Objective sketch\n  • Train soft tokens g to minimize downstream loss while replacing large spans of text with g (e.g., minimize cross-entropy of target completion given g, or reconstruct key signals with an information bottleneck penalty).\n- Pros/cons\n  • Pros: Very compact (e.g., tens of tokens in place of thousands); fast at inference after training; good fit for stable domains.\n  • Cons: Requires model access/finetuning; domain shift hurts; less flexible across tasks.\n- Sources: GIST tokens (paper/preprint), AutoCompressor (technical report/blog).\n\n3) Data distillation + compressor training\n- Idea: Use a strong LLM to create (original → compressed) supervision, then train a lightweight compressor.\n- Representative methods\n  • LLMLingua-2: Distill keep/drop labels or compressed spans with GPT-4-class models; train a small encoder to predict token retention.\n  • RECOMP: Train extractive and abstractive compressors for RAG; extractive selects salient spans; abstractive fuses multi-doc evidence into compact summaries with citations.\n- Training sketch\n  1) Assemble corpus C and task queries Q.\n  2) Use a high-end LLM to produce compressed targets y* for (x, q). Include rationales/citations if possible.\n  3) Train a small model fθ to predict keep/drop or generate compressed y.\n  4) Deploy fθ as a fast preprocessor in front of your target LLM.\n- Reported effects\n  • LLMLingua-2: Higher quality at same CR vs earlier LLMLingua; strong speed/cost reduction with minimal quality loss.\n  • RECOMP: Improves RAG accuracy at equal or lower token budgets; abstractive compression yields the biggest savings at mild quality cost.\n- Pros/cons\n  • Pros: High quality/CR; query-aware; works with black-box target LLMs at inference.\n  • Cons: Requires data generation and training; maintenance for new domains.\n- Sources: LLMLingua-2 (Microsoft Research), RECOMP (Google/DeepMind research paper + blog).\n\n4) Token merging/pruning at inference (architectural/inference-time)\n- Idea: Reduce the number of active tokens processed through the transformer layers.\n- Representative methods\n  • ToMe (Token Merging): Merge similar token representations across layers to reduce compute.\n  • AdapLeR (Adaptive Length Reduction): Dynamically prune less-important tokens during inference.\n- Implementation notes\n  • Often integrated at the model runtime; may require custom kernels or modified inference engines.\n  • Complements prompt-side compression; can be combined with KV-cache optimizations.\n- Pros/cons\n  • Pros: Significant speedups with limited quality drop; orthogonal to prompt techniques.\n  • Cons: Requires architectural/runtime changes; harder to plug into closed APIs.\n- Sources: ToMe (CVPR paper + GitHub), AdapLeR (arXiv/preprint).\n\nPost‑2024 directions (what’s new/strong trends)\nNote: My training data ends in 2024-10, so I can’t cite 2025 papers directly. The most active post‑2024 directions you’ll see in recent preprints, conference talks, and engineering blogs:\n- Retriever–compressor co-training: Jointly train retrieval scoring with extractive/abstractive compressors to optimize end-task metrics under a strict token budget.\n- RL for budgeted compression: Reinforcement learning or bandit controllers that adapt CR per query/document, trading cost vs. accuracy on the fly.\n- Structure-preserving compression: Abstractive compressors that keep citations/IDs and schema (JSON) so answers can attribute sources precisely in RAG.\n- Multi-stage pipelines: Fast extractive shrink → light abstractive fuse → final instruction polish; often beats single-stage approaches at the same budget.\n- Inference-runtime fusion: Combining prompt compression with token pruning/early exit/speculative decoding in one scheduler to minimize tail latency.\n- Domain-specialized compressors: Small, finetuned compressors per domain (code, finance, biomed) that outperform general compressors at the same CR.\n\nHow to implement (ready-to-run)\n1) LLMLingua/LLMLingua‑2 (Python)\n- Install\n  pip install llmlingua\n- Minimal usage\n  from llmlingua import PromptCompressor\n  compressor = PromptCompressor(\n      model_name=\"microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank\",\n      use_llmlingua2=True,\n      device_map=\"cpu\"  # use \"cuda\" if available\n  )\n  text = \"\"\"\n  [instruction]\n  You are an LLM...\n  [context]\n  Long retrieved docs...\n  [question]\n  What are the key risk factors?\n  \"\"\"\n  out = compressor.compress_prompt(text, rate=0.4, force_tokens=[\"\\n\"], drop_consecutive=True)\n  print(out[\"compressed_prompt\"], out[\"origin_tokens\"], out[\"compressed_tokens\"], out[\"ratio\"])\n\n2) LangChain RAG integration (query-aware compression)\n- Install\n  pip install langchain langchain_community\n- Usage\n  from langchain.retrievers import ContextualCompressionRetriever\n  from langchain_community.document_compressors import LLMLinguaCompressor\n\n  base_retriever = your_vectorstore.as_retriever(search_kwargs={\"k\": 8})\n  compressor = LLMLinguaCompressor(\n      model_name=\"microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank\",\n      use_llmlingua2=True\n  )\n  compression_retriever = ContextualCompressionRetriever(\n      base_retriever=base_retriever,\n      base_compressor=compressor,\n  )\n  compressed_docs = compression_retriever.get_relevant_documents(\"What’s the mechanism of action?\")\n\n3) Selective Context (information-theoretic)\n- Install\n  pip install selective-context spacy\n  python -m spacy download en_core_web_sm\n- Usage\n  from selective_context import SelectiveContext\n  sc = SelectiveContext(model_type='gpt2', lang='en')\n  text = \"Your long document here...\"\n  context, reduced = sc(text, reduce_ratio=0.5)\n\n4) Evaluation harness (measure tokens, cost, latency)\n- Install\n  pip install tiktoken\n- Usage\n  import time, tiktoken\n\n  enc = tiktoken.get_encoding(\"cl100k_base\")\n  def count_tokens(s):\n      return len(enc.encode(s))\n\n  def evaluate(compressor, prompt, rate=0.4):\n      t0 = time.time()\n      out = compressor.compress_prompt(prompt, rate=rate)\n      t1 = time.time()\n      orig = count_tokens(prompt)\n      comp = count_tokens(out[\"compressed_prompt\"]) if isinstance(out, dict) else count_tokens(out)\n      cr = orig / max(comp, 1)\n      return {\n          \"orig_tokens\": orig,\n          \"compressed_tokens\": comp,\n          \"compression_ratio\": round(cr, 2),\n          \"compression_overhead_ms\": int((t1 - t0) * 1000),\n          \"$-saving@model\": f\"≈ {(orig - comp)/1000:.3f} x $/1K-tokens\"\n      }\n\nTuning tips (to hit SLOs)\n- Start with extractive (LLMLingua‑2) at CR 2–4× for RAG; increase gradually while watching exact-match/F1 on held-out queries.\n- Allocate budgets by section: keep instructions and question lightly compressed, compress retrieved context more aggressively.\n- Always preserve: headings, numbers, named entities, citations/URLs, tables-as-JSON; add them to force_tokens or span-preserve lists.\n- Cache compressed contexts per document hash; amortize overhead across many queries.\n- Combine with retrieval pruning (reduce k) before compression for even larger savings.\n\nTrade-offs to expect (rules of thumb)\n- Quality vs. CR: Extractive methods are reliable to ~3–6× CR with small drops on many QA/RAG tasks; abstractive can push further at higher risk.\n- Overhead vs. savings: Compression adds tens to hundreds of milliseconds; net win dominates once prompts exceed a few thousand tokens or you reuse compressed contexts.\n- Human readability: Compressed text is optimized for machines; keep a human-readable trace only for debugging/auditing.\n\nQuick comparison table\n| Family | Examples | Training needed | Works with closed LLMs | Typical use | Notes |\n|---|---|---|---|---|---|\n| Info-theoretic extractive | Selective Context, LLMLingua, LongLLMLingua | No | Yes | RAG, long chats | Fast, query-aware; modest overhead |\n| Learned/soft prompts | GIST, AutoCompressor | Yes | No (needs model access) | Stable domains, agents | Few learned tokens replace long history |\n| Distillation-trained | LLMLingua‑2, RECOMP | Yes (compressor only) | Yes | High-quality RAG | Best quality/CR; setup cost |\n| Token merging/pruning | ToMe, AdapLeR | Runtime mods | No (needs runtime) | Latency-critical serving | Pairs well with prompt compression |\n\nWhat to read (reliable sources)\n- LLMLingua and LongLLMLingua — Microsoft Research (paper + GitHub: microsoft/LLMLingua). Prompt compression with small LMs and budget controllers for long contexts.\n- LLMLingua‑2 — Microsoft Research (paper). Data-distilled compressor with improved quality at the same budget.\n- RECOMP — Research paper + technical blog (Google/DeepMind). Retrieve-then-compress for RAG with extractive and abstractive compressors.\n- Selective Context — Research paper and Python library. Information-theoretic pruning using self-information.\n- ToMe (Token Merging) — CVPR paper + GitHub (facebookresearch/ToMe). Token merging to accelerate transformers; variants applied to text models.\n- AdapLeR — arXiv/preprint. Adaptive length reduction during inference.\n\nIf you want, I can tailor a minimal, reproducible benchmark script around your corpus and target model to plot CR vs. accuracy vs. cost, and suggest a safe operating point.",
  "gold_information": [
    "Prompt compression reduces input tokens while preserving task-relevant information to lower latency and cost.",
    "It is especially useful for retrieval-augmented generation, long conversations, long documents, and tool-using agents.",
    "Performance is evaluated by compression ratio, latency, throughput, token cost, and task-specific quality metrics.",
    "Information-theoretic extractive methods drop low-information content using token probability estimates.",
    "These extractive methods score units by self-information and query relevance and select top spans under a token budget.",
    "Extractive methods are black-box compatible and require no finetuning.",
    "Extractive methods introduce preprocessing overhead and can lose subtle discourse cues.",
    "Learned soft prompts compress history into a small set of learned summary tokens.",
    "Soft-prompt approaches deliver very compact representations and fast inference after training.",
    "Soft-prompt approaches require model access and can degrade under domain shift.",
    "Distillation-based compressors are trained using supervision from a stronger model to predict what to keep or generate.",
    "Distilled compressors can be query-aware and achieve high quality at the same compression ratio.",
    "Distillation approaches require data generation, training, and maintenance for new domains.",
    "Inference-time token merging combines similar token representations to reduce compute.",
    "Inference-time adaptive pruning removes less important tokens during decoding.",
    "Inference-time reductions yield speedups but require runtime or architectural modifications.",
    "Retriever and compressor can be jointly trained to optimize accuracy under a token budget.",
    "Reinforcement controllers can adapt compression per query to balance cost and accuracy.",
    "Structure-preserving compression keeps citations, identifiers, and schema for faithful attribution.",
    "Multi-stage pipelines that combine extractive and abstractive steps often outperform single-stage compressors at the same budget.",
    "Integrating compression with early exit, speculative decoding, and token pruning can minimize tail latency.",
    "Domain-specialized compressors tailored to code, finance, or biomedicine outperform general compressors at fixed budgets.",
    "Allocate token budgets by section and compress retrieved context more aggressively than instructions or questions.",
    "Preserve headings, numbers, named entities, citations, URLs, and structured tables during compression.",
    "Cache compressed contexts by document hash to amortize preprocessing overhead.",
    "Reduce retrieval set size before compression to further cut tokens and latency.",
    "Extractive methods remain robust up to moderate compression ratios with small quality loss on many QA tasks.",
    "Abstractive compressors can reach higher compression at increased quality risk.",
    "Compression adds processing overhead but yields net savings for long prompts or reused contexts.",
    "Compressed text favors machine readability and may require separate human-readable traces for debugging."
  ]
}