Abstract: This paper introduces SummChat, a novel approach to enhance token efficiency in conversational agents via dual LLMs leveraging a virtual context. Focused on multi-round conversation situations, SummChat integrates a second and inexpensive LLM to act as a token reduction model between the user and the main language model. This model processes user prompts before reaching the main model, which allows the input to be reduced. This secondary model can efficiently eliminate extraneous information while providing sufficient context for the more advanced main model to answer appropriately. Additionally, this token-reduced prompt remains comprehensible to a human observer to facilitate greater downstream applications. This token-reduction method is enhanced by the use of virtual context, which is used to preserve original user prompts in conversational history, allowing the main model to retrieve specific user-provided information if needed. This system facilitates preservation of response quality across multi-round conversations.
Experimental results indicate an average response quality degradation of only 2.05% in exchange for a 13.26% reduction in input token usage when compared with SOTA. This results in an improvement of 12.4% in quality per 100 tokens. This paper demonstrates SummChat's potential in balancing response quality and cost-effectiveness, providing a new technique through which future works can leverage powerful LLMs in a more cost-efficient manner.
Paper Type: long
Research Area: Efficient/Low-Resource Methods for NLP
Contribution Types: NLP engineering experiment, Data analysis
Languages Studied: English
0 Replies
Loading