Causal Semantic Steering from Numeric Feature Injection: A REAL–SHUFFLE Evaluation of Metaphor Ground Generation
Keywords: interpretable feature injection; counterfactual evaluation; semantic steering; controllable generation; metaphor ground generation
Abstract: Numeric feature injection is increasingly used to steer generation, yet its apparent effects can be confounded by prompt templates and marginal value statistics. We propose a simple counterfactual evaluation that isolates the causal contribution of correct instance–value alignment. For each input, we compare REAL generations conditioned on the aligned value vector to a SHUFFLE control that preserves the same prompt format and the same marginal distribution of values while breaking alignment, and define the effect as VALUE = REAL − SHUFFLE. To quantify whether VALUE concentrates movement in the intended semantic subspace, we project embedding differences onto matched bipolar semantic axes and report two planned primary metrics: AbsDir, a polarity-robust magnitude measure of matched-axis attraction, and EnergyAboveChance, which measures matched-subspace energy beyond a chance baseline. Across 12 planned tests (3 injection families × 2 embedding spaces × 2 metrics) with paired inference and BH-FDR control, we find strong family-by-space heterogeneity. Injection A yields reliable matched-subspace attraction in sentence space on both metrics, while its word-space effects are weak. Injection B shows the clearest attraction in word space, with negligible effects in sentence space. In contrast, Injection C exhibits dispersion in both spaces, producing significantly negative VALUE on both metrics. These results demonstrate that alignment-driven steering can be detected causally, but its success depends on the injection family and the representational space used for evaluation.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: counterfactual/contrastive explanations; data shortcuts/artifacts; probing; robustness; explanation faithfulness
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data analysis
Languages Studied: Chinese
Submission Number: 7198
Loading