Staircase Streaming for Low-Latency Multi-Agent Inference

Staircase Streaming for Low-Latency Multi-Agent Inference

ACL ARR 2024 August Submission381 Authors

16 Aug 2024 (modified: 12 Sept 2024)ACL ARR 2024 August SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract:

Recent advances in large language models (LLMs) opened up new directions for leveraging the collective expertise of multiple LLMs. These methods, such as Mixture-of-Agents, typically employ additional inference steps to generate intermediate outputs, which are then used to produce the final response. While multi-agent inference can enhance response quality, it can significantly increases the time to first token (TTFT), posing a challenge for latency-sensitive applications and hurts user experience. To address this issue, we propose staircase streaming for low-latency multi-agent inference. Instead of waiting for the complete intermediate outputs from previous steps, we begin generating the final response as soon as we receive partial outputs from these steps. Experimental results demonstrate that staircase streaming reduces TTFT by up to 93% while maintaining response quality.

Paper Type: Short

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: inference methods, NLP in resource-constrained settings, multi-modal dialogue systems

Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 381

Loading