Mechanistic Interpretability for Neural TSP Solvers

Published: 28 Nov 2025, Last Modified: 30 Nov 2025NeurIPS 2025 Workshop MLxOREveryoneRevisionsBibTeXCC BY 4.0
Keywords: Mechanistic interpretability, Sparse autoencoders, Neural combinatorial optimization, Traveling Salesman Problem, Transformers, Reinforcement learning
TL;DR: We apply Mechanistic Interpretability to a neural TSP solver, finding meaningful learned features in the model's neurons.
Abstract: Neural networks have advanced combinatorial optimization, with Transformer-based solvers often outperforming classical algorithms on the Traveling Salesman Problem (TSP). However, these models remain black boxes, limiting our understanding of what optimization strategies they discover. We apply sparse autoencoders (SAEs) to interpret neural TSP solvers. To our knowledge, this is the first work to bring mechanistic interpretability from deep learning models to operations research. Our analysis reveals interpretable features that these solvers naturally developed that mirror fundamental TSP-solving concepts: boundary detection, spatial clustering, and geometric separations. These discoveries reveal how neural solvers approach combinatorial problems and suggest new directions for hybrid approaches that combine algorithmic transparency with neural performance. Interactive feature explorer: \url{https://reubennarad.github.io/TSP_interp/}
Submission Number: 162
Loading