23.7 BROCA: A 52.4-to-559.2mW Mobile Social Agent System-on-Chip with Adaptive Bit-Truncate Unit and Acoustic-Cluster Bit Grouping

Wooyoung Jo, Seongyon Hong, Jiwon Choi, Beomseok Kwon, Haoyang Sang, Dongseok Im, Sangyeob Kim, Sangjin Kim, Taekwon Lee, Hoi-Jun Yoo

Published: 2025, Last Modified: 27 May 2026ISSCC 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: On-device artificial intelligence (AI) enables human-like conversations with memory-constrained personal devices, or personalized mobile agents [1]. Users can collaborate with AI devices that comprehend the user's multimodal states, including utterances and emotions based on facial images. These devices generate responses with appropriate emotions and conversational context as their voice feedback. Figure 23.7.1 shows the proposed four-stage personalized mobile social agent system. The user perception (UP) stage converts the user's multimodal inputs into the user state using a multimodal encoder [2], [3]. Retrieval-augmented generation (RAG) [4], [5] is used to retrieve dialogue context relevant to the user states. The response generation (RG) stage then generates human-level text responses by employing a transformer-based language model [6], [7]. The emotion generation (EG) stage identifies the agent's emotion from the generated text response [11]. Finally, the agent feedback (AF) stage synthesizes the agent's audio feedback using a vocoder [12], based on the generated agent's text response and emotion.
Loading