Variant-specific crosscoder features are seed-stable but not detectably task-causal in a GRPO-LoRA math setting

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 SpotlightEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Concept Discovery (e.g., SAEs, dictionary learning), Methods (probing, steering, causal interventions), Benchmarking Interpretability
Other Keywords: crosscoders, model diffing, GRPO, reinforcement learning, negative result
TL;DR: High-ν crosscoder features in a Qwen3-4B vs. GRPO-LoRA pair are non-lexically seed-stable but not detectably task-causal; base-vs-base controls show the gate responds to between-side difference, not task-causal RL computation.
Abstract: We test whether variant-specific features identified by joint-norm pairwise crosscoders, trained on activation pairs from a base LLM and its RL fine-tune, correspond to task-causal mechanisms underlying the observed fine-tuned behavior. In a Qwen3-4B vs. GRPO-LoRA math setting, high-$\nu$ features pass a non-lexical cross-seed reproducibility check but fail task-causal specificity under $n=100$ paired ablation against a magnitude-matched random control. Two complementary base-vs-base controls diagnose what the gate is responding to: under paired-identity ($A_t = B_t$ exactly) it produces zero high-$\nu$ features in all 12 trained crosscoders, while under disjoint halves it produces 50–200$\times$ as many features as base/GRPO with similar non-lexical seed-stability. The gate therefore responds to systematic between-side distributional difference, and a large high-$\nu$ population can arise from unpaired-pair reconstruction asymmetry; in the paired base/GRPO setting, the remaining high-$\nu$ population is much smaller, is consistent with model-pair distributional drift, and is not detectably task-causal under our ablations. A high-$\nu$ gate, even combined with non-lexical seed stability, is insufficient evidence for task-causal mechanisms underlying the observed fine-tuned behavior.
Submission Number: 98
Loading