Abstract: This paper presents BM1684X, a cutting-edge AI processor from SOPHGO designed to meet the demanding requirements of broad AI applications. Firstly, we employ SIMD architecture with very large data width to design our TPU to reduce the area ratio of the instruction unit and greatly improves the computing power density. Secondly, the customization of special acceleration instructions within the EU enables the dynamic pipeline execution, leading to a reduction in the total number of instructions and execution time. This customization enhances the performance of TPU in processing RQ and DQ operations, crucial for AI computations. Thirdly, the CUBE array within the TPU implements the multiplication and addition operations of 64 pairs of INT8 operands in the channel dimension of the feature map. By utilizing an addition tree instead of a conventional adder, the implementation significantly reduces both area and power consumption, optimizing the efficiency of TPU. Additionally, the BM1684X processor incorporates a 64-input, 64-output, 8-bit crossbar within the lane, facilitating high-performance data gathering. This crossbar design enhances data gathering capabilities, enabling efficient data processing and manipulation within the TPU architecture. Furthermore, BM1684X offers three distinct memory access modes, showing the processor's versatility in addressing a wide range of AI processing needs and optimizing DRAM utilization for various tasks and workloads. Finally, we design a TPU-MLIR toolchain, highlighting its rich features such as unified processing of multiple frameworks, hierarchical design of model abstractions, correctness guarantees, and traceability of each transformation step. BM1684X excels in providing high-performance computing for a variety of AI models including large models, demonstrating its capabilities through comprehensive evaluations with industry-leading peers.
External IDs:dblp:conf/micro/GaoLWCSHQW24
Loading