FRAPPE: Fast RAG-Inspired Prompt Evaporator

Arnab Chakraborty; William McCormack; Zachary semenov; J. Jeffrey Brown; Mohammadreza Soltani

FRAPPE: Fast RAG-Inspired Prompt Evaporator

Arnab Chakraborty, William McCormack, Zachary semenov, J. Jeffrey Brown, Mohammadreza Soltani

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Compression, Prompt engineering, Efficient LLM Inference, Toxicity reduction, Task-agnostic, Summarization

TL;DR: New FRAPPE Algorithm: Efficiently compresses input prompts for LLMs, reducing cost, latency and toxicity while maintaining performance.

Abstract: Large Language Models (LLMs) have demonstrated remarkable performance in various tasks such as multi-document QA, summarization, and text classification. This has been achieved in part by recent advancements in prompt engineering and in-context learning (ICL), enabling LLMs to consume tens of thousands of input tokens as the supported context for the given query. However, this creates higher computational costs, longer latency, and potential performance degradation. To address these issues, we propose a task-agnostic and efficient approach called “Fast RAG Inspired Prompt Evaporator”, or FRAPPE, to significantly reduce LLMs’ latency, memory requirement, and computation by compressing input tokens. Unlike many other proposed approaches for prompt compression, our method does not rely on any large model for computing conditional probabilities, and data preparation is fast with negligible memory requirements. In particular, our approach first pre-processes the input data, categorizes and ranks phrases based on their informativeness, and finally selects the highest-ranked phrases to generate highly compressed and extractive input. We show the efficacy of our approach through a comprehensive set of experiments on public datasets and benchmarks. For instance, on the summarization task of the MeetingBank dataset, at a compression rate of 70%, our proposed approach achieves performance similar to the full context while performing compression up to 4 times faster than the contemporary state of the art compression algorithms. We extend FRAPPE to create the Context-Aware FRAPPE algorithm, which incorporates task-specific information when ranking phrases, which further improves performance of downstream tasks using compressed text. Additionally, we demonstrate that the use of FRAPPE can reduce toxicity by close to 50% relative to the original text by removing extraneous vitriolic phrases, in contrast to other compression methods, which often increase toxicity.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 7348

Loading