Minstrel: Application-Aware SLM Inference Optimization on Edge Devices

Published: 21 May 2025, Last Modified: 21 Jun 2025MLArchSys 2025 OralEveryoneRevisionsBibTeXCC BY 4.0
Presentation: Virtual
Keywords: Edge inference, Application-aware optimizations, Small language models, Optimization zones
Presenter Full Name: Bakshree Mishra
TL;DR: We propose a prediction framework to systematically evaluate optimization strategies for edge-based small language model inference, considering that prompt and output length distributions may differ from cloud workloads.
Presenter Email: bmishra3@illinois.edu
Abstract: Large language models (LLMs) have permeated different fields of computing, including agentic systems and controllers. Recent literature has introduced smaller language models (SLMs) capable of running on edge hardware, unlocking opportunities to significantly impact human and computer interaction. Following trends of LLM inference optimization on data centers, optimization of SLM inference on edge devices focuses on independently accelerating the prefill or decode phases. However, we expect the tasks targeted for SLM inference to not follow the same input and output length distributions as remote LLM inference, necessitating a reevaluation of options for hardware and software optimizations. Further, previous work does not study the impact of their optimizations in context of different downstream applications, and the benefits seen in their isolated evaluations are not generalizable. In this work, we present Minstrel, an application-aware optimization framework for SLM inference on edge hardware. Minstrel introduces a hybrid empirical and analytical model to predict the inference latency for an application given an SLM and hardware. Using Minstrel, we divide the application space into a prefill-dominated P-Zone and a decode-dominated D-Zone. Leveraging the two zones, we make the observation that, for a certain range of applications, optimizing prefill phase is ineffective.
Presenter Bio: Bakshree Mishra is a fourth year PhD student at the University of Illinois Urbana Champaign, advised by Prof. Sarita Adve. She is interested in understanding the evolving computation and memory requirements in emerging application domains, and designing flexible and heterogeneous hardware architecture.
Paper Checklist Guidelines: I certify that all co-authors have validated the presented results and conclusions, and have read and commit to adhering to the Paper Checklist Guidelines, Call for Papers and Publication Ethics.
YouTube Link: https://youtu.be/lvjLK4U4ZLQ
YouTube Link Poster: NA
Dataset Release: I certify that all co-authors commit to release the dataset and necessary scripts to reproduce the presented results.
Google Slides: https://docs.google.com/presentation/d/1odXvbe8zM_3i83GBpj6s2alEPjN69kB2dN0kSDRNfYM/edit?usp=sharing
Poster: No
Workshop Registration: Yes, the presenter has registered for the workshop.
YouTube Link Short: TBD
Submission Number: 19
Loading