Abstract: Understanding user intents from UI interaction trajectories remains a challenging, yet crucial, frontier in intelligent agent development. While massive, datacenter-based, multi-modal large language models (MLLMs) possess greater capacity to handle the complexities of such sequences, smaller models which can run on-device to provide a privacy-preserving, low-cost, and low-latency user experience, struggle with accurate intent inference. In this paper, we address these limitations by introducing a novel decomposed approach: first, we perform structured interaction summarization, capturing key information from each user action. Second, we perform fine-tuned intent extraction model operating on the aggregated summaries. Remarkably, this method empowers resource-constrained models to not only achieve improved intent understanding, but also surpass the base performance of large MLLMs.
Paper Type: Long
Research Area: Generation
Research Area Keywords: efficient models, model architectures, inference methods, UI-to-text generation
Contribution Types: Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 1418
Loading