FPGA Implementation of PoolFormer Network Using Python-Driven High-Level Synthesis Framework for Edge-AIoT Speech Recognition

Tiancheng Cao, Zhongyi Zhang, Wei Soon Ng, Wang Ling Goh, Yuan Gao

Published: 2026, Last Modified: 06 Apr 2026IEEE Trans. Very Large Scale Integr. Syst. 2026EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: This brief presents an edge-AIoT speech recognition system, which is based on a new spiking feature extraction (SFE) method and a PoolFormer (PF) neural network optimized for implementation on field-programmable gate array (FPGA) hardware. A Python-driven high-level synthesis (HLS) flow is adopted to accelerate software-to-hardware conversion for fast validation, demonstrating the potential of FPGA-based solutions in edge applications. This work provides a holistic end-to-end solution for ultralow-power speech recognition, leveraging HLS to bridge the gap between software and hardware development. Implemented in a Xilinx PYNQ-Z2 FPGA board, this optimized PF model achieved a speech recognition accuracy rate of 95.41% on the 35-class Google Commands dataset with a parameter count of 39k.