Causal Semantic Steering from Numeric Feature Injection: A REAL–SHUFFLE Evaluation of Metaphor Ground Generation

Causal Semantic Steering from Numeric Feature Injection: A REAL–SHUFFLE Evaluation of Metaphor Ground Generation

ACL ARR 2026 January Submission7198 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: interpretable feature injection; counterfactual evaluation; semantic steering; controllable generation; metaphor ground generation

Abstract: Numeric feature injection is increasingly used to steer generation, yet its apparent effects can be confounded by prompt templates and marginal value statistics. We propose a simple counterfactual evaluation that isolates the causal contribution of correct instance–value alignment. For each input, we compare REAL generations conditioned on the aligned value vector to a SHUFFLE control that preserves the same prompt format and the same marginal distribution of values while breaking alignment, and define the effect as VALUE = REAL − SHUFFLE. To quantify whether VALUE concentrates movement in the intended semantic subspace, we project embedding differences onto matched bipolar semantic axes and report two planned primary metrics: AbsDir, a polarity-robust magnitude measure of matched-axis attraction, and EnergyAboveChance, which measures matched-subspace energy beyond a chance baseline. Across 12 planned tests (3 injection families × 2 embedding spaces × 2 metrics) with paired inference and BH-FDR control, we find strong family-by-space heterogeneity. Injection A yields reliable matched-subspace attraction in sentence space on both metrics, while its word-space effects are weak. Injection B shows the clearest attraction in word space, with negligible effects in sentence space. In contrast, Injection C exhibits dispersion in both spaces, producing significantly negative VALUE on both metrics. These results demonstrate that alignment-driven steering can be detected causally, but its success depends on the injection family and the representational space used for evaluation.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: counterfactual/contrastive explanations; data shortcuts/artifacts; probing; robustness; explanation faithfulness

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data analysis

Languages Studied: Chinese

Submission Number: 7198

Loading