Abstract: Optimization engines for LLM query serving typically focus on workloads with known structure, treating the query itself as a black box.
In this work, we investigate extracting parallelization opportunities from individual queries that have decomposable subtasks. Using the LMSYS-chat-1M dataset, we identify three query categories that are amenable to decomposition into parallel LLM calls, and curate a dataset of these queries as a benchmark for this type of within-query parallelization. We develop a prototype system to parallelize such queries and report initial performance results, showing that parallelization can result in a speedup of 5x over serial execution with comparable or even improved generation quality.
Paper Type: Short
Research Area: Efficient/Low-Resource Methods for NLP
Research Area Keywords: inference efficiency, parallelization, query optimization
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 6215
Loading