Abstract: Understanding user intents from UI interaction trajectories remains a challenging, yet crucial, frontier in intelligent agent development.
While massive, datacenter-based, multi-modal large language models (MLLMs) possess greater capacity to handle the complexities of such sequences, smaller models which can run on-device to provide a privacy-preserving, low-cost, and low-latency user experience, struggle with accurate intent inference.
We address these limitations by introducing a novel decomposed approach: first, we perform structured interaction summarization, capturing key information from each user action. Second, we perform intent extraction using a fine-tuned model operating on the aggregated summaries.
This method improves intent understanding in resource-constrained models, even surpassing the base performance of large MLLMs.
Paper Type: Long
Research Area: Generation
Research Area Keywords: efficient models, model architectures, inference methods, UI-to-text generation
Contribution Types: Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 6521
Loading