Extracting Parallelism from Large Language Model Queries

Extracting Parallelism from Large Language Model Queries

ACL ARR 2025 February Submission6215 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Optimization engines for LLM query serving typically focus on workloads with known structure, treating the query itself as a black box. In this work, we investigate extracting parallelization opportunities from individual queries that have decomposable subtasks. Using the LMSYS-chat-1M dataset, we identify three query categories that are amenable to decomposition into parallel LLM calls, and curate a dataset of these queries as a benchmark for this type of within-query parallelization. We develop a prototype system to parallelize such queries and report initial performance results, showing that parallelization can result in a speedup of 5x over serial execution with comparable or even improved generation quality.

Paper Type: Short

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: inference efficiency, parallelization, query optimization

Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 6215

Loading