A NUMA Aware Compiler Framework for Large Scale Mathematical Reasoning Inference on PCIe Based Multi Accelerator Systems
Keywords: NUMA, Compiler, Large-Scale
TL;DR: NUMA aware compiler framework that accelerates large scale inference on PCIe-based multi accelerator systems by co-optimizing partitioning, placement, and collective schedules
Abstract: Mathematical reasoning workloads proof search, program verification, equation solving, code as proof traces, and tool augmented LLM pipelines demand long context decoding, speculative/beam search, and mixture of experts, which induce frequent collectives under model/tensor parallelism. In commodity dual socket NUMA servers with PCIe interconnects, non uniform link bandwidth/latency and host mediated cross socket routes make these collectives the bottleneck, inflating end to end latency that matters for interactive theorem proving and education at scale.
We present a NUMA aware compiler framework for large scale math reasoning inference. The system profiles compute and memory paths, learns a latency bandwidth cost model for hierarchical collectives, and jointly optimizes data, model, and tensor partitioning along with device memory placement under static feasibility constraints. Using MLIR/TOSA templates, it emits host and accelerator code with explicit comm–compute overlap and schedule shaping via ring, tree, and hybrid schemes, without relying on vendor specific fabrics. We target math AI pipelines such as LLMs with solver or tool use and prover trace generation, and We outline ablations to isolate how profiling, static analysis, schedule choice, and overlap affect throughput and p95 latency.
Submission Number: 253
Loading