BlockCert: Certified Blockwise Extraction of Transformer Mechanisms

BlockCert: Certified Blockwise Extraction of Transformer Mechanisms

TMLR Paper6567 Authors

19 Nov 2025 (modified: 17 Jan 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Mechanistic interpretability aspires to reverse-engineer neural networks into explicit algorithms, while model editing seeks to modify specific behaviours without retraining. Both areas are typically evaluated with informal evidence and ad-hoc experiments, with few explicit guarantees about how far an extracted or edited model can drift from the original on relevant inputs. We introduce BLOCKCERT, a framework for certified blockwise extraction of transformer mechanisms, and outline how a lightweight extension can support certified local edits. Given a pre-trained transformer and a prompt distribution, BLOCKCERT extracts structured surrogate implementations for residual blocks together with machine-checkable certificates that bound approximation error, record coverage metrics, and hash the underlying artifacts. We formalize a simple Lipschitz-based composition theorem in Lean 4 that lifts these local guarantees to a global deviation bound. Empirically, we apply the framework to GPT-2 small, TinyLlama-1.1B-Chat, and Llama-3.2-3B. Across these models we obtain high per-block coverage and small residual errors on the evaluated prompts, and in the TinyLlama setting we show that a fully stitched model matches the baseline perplexity within $\approx 6\times 10^{-5}$ on stress prompts. Our results suggest that blockwise extraction with explicit certificates is feasible for real transformer language models and offers a practical bridge between mechanistic interpretability and formal reasoning about model behaviour.

Submission Type: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: We clarify the positioning of BlockCert as a tooling and standardization pipeline; improve the workflow figure and block-definition details; add larger-scale experiments on Llama-2-7B; introduce low-rank surrogate and ablation baselines; and extend BlockCert with a refusal/safety evaluation on Llama-2-7B-Chat.

Assigned Action Editor: ~Martha_Lewis1

Submission Number: 6567

Loading