Keywords: large language model, parallelization, systems for machine learning
TL;DR: We identify opportunities for parallelization within a single query's execution, contribute a benchmark dataset of such queries and provide initial results of a performance baseline.
Abstract: Optimization engines for LLM query serving typically focus on workloads with known structure, treating the query itself as a black box. In this work, we investigate extracting parallelization opportunities from individual queries that have decomposable subtasks. Using the LMSYS-chat-1M dataset, we identify three query categories that are amenable to decomposition into parallel LLM calls, and curate a dataset of these queries as a benchmark for this type of within-query parallelization. We develop a prototype system to parallelize these queries and report initial performance results, showing that parallelization can result in a speedup of 5x over serial execution with comparable or even improved generation quality.
Submission Number: 105
Loading