Variant-specific crosscoder features are seed-stable but not detectably task-causal in a GRPO-LoRA math setting

Nozomu Fujisawa; Masaaki Kondo

Variant-specific crosscoder features are seed-stable but not detectably task-causal in a GRPO-LoRA math setting

Nozomu Fujisawa, Masaaki Kondo

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 SpotlightEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Concept Discovery (e.g., SAEs, dictionary learning), Methods (probing, steering, causal interventions), Benchmarking Interpretability

Other Keywords: crosscoders, model diffing, GRPO, reinforcement learning, negative result

TL;DR: High-ν crosscoder features in a Qwen3-4B vs. GRPO-LoRA pair are non-lexically seed-stable but not detectably task-causal; base-vs-base controls show the gate responds to between-side difference, not task-causal RL computation.

Abstract: We test whether variant-specific features identified by joint-norm pairwise crosscoders, trained on activation pairs from a base LLM and its RL fine-tune, correspond to task-causal mechanisms underlying the observed fine-tuned behavior. In a Qwen3-4B vs. GRPO-LoRA math setting, high-$\nu$ features pass a non-lexical cross-seed reproducibility check but fail task-causal specificity under $n=100$ paired ablation against a magnitude-matched random control. Two complementary base-vs-base controls diagnose what the gate is responding to: under paired-identity ($A_t = B_t$ exactly) it produces zero high-$\nu$ features in all 12 trained crosscoders, while under disjoint halves it produces 50–200$\times$ as many features as base/GRPO with similar non-lexical seed-stability. The gate therefore responds to systematic between-side distributional difference, and a large high-$\nu$ population can arise from unpaired-pair reconstruction asymmetry; in the paired base/GRPO setting, the remaining high-$\nu$ population is much smaller, is consistent with model-pair distributional drift, and is not detectably task-causal under our ablations. A high-$\nu$ gate, even combined with non-lexical seed stability, is insufficient evidence for task-causal mechanisms underlying the observed fine-tuned behavior.

Submission Number: 98

Loading