Bridging Non-Intrusive Tracing and Fine-Grained Cross-Layer Representations for LLM Inference Diagnosis

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM profiling; Anomaly detection; Performance diagnosis; LLM observability; LLM tracing
Abstract: LLM inference spans the inference engine, compute backend, host operators, and GPU kernels, where asynchrony and concurrency make request-level end-to-end observability and diagnosis challenging. We present $\textbf{Truffld}$, a non-intrusive and cross-layer framework that provides fine-grained representations for diagnosis in large-scale LLM inference. For data collection, $\textbf{Truffld}$ activates NVTX markers and CUPTI callbacks to capture raw events from vertical (intra-node stack execution) and horizontal (cross-node communication) perspectives. We then propose a call-chain merging algorithm that aligns these events on a unified time base and reconstructs a per-request call-chain tree preserving both structural and temporal semantics. For anomaly detection, $\textbf{Truffld}$ adopts a two-stage pipeline. A Gaussian Mixture Model models multi-modal normal behavior and produces calibrated numeric confidences, while a large language model applies structure- and context-aware reasoning to generate step-level decisions and operator-level localization. Experiments on a multi-node GPU cluster running Qwen3-8B inference with both online and offline workloads demonstrate near-perfect step-level detection and superior operator-level performance compared to multiple baselines, with low deployment overhead and no modification to binaries. $\textbf{Truffld}$ provides a practical end-to-end solution for observability and diagnosis in large-scale LLM inference.
Supplementary Material: zip
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 6342
Loading