Can you steer Whisper with steering vectors from GPT2-xl?

Published: 28 Apr 2026, Last Modified: 28 Apr 2026MSLD 2026 PosterEveryoneRevisionsCC BY 4.0
Keywords: Large Language Models, Mechanistic Interpretability
TL;DR: We showed that different modalities have shared structures by transferring steering vectors from text to audio language models.
Abstract: Recent work on activation steering demonstrates that internal representations in large language models (LLMs) encode manipulable semantic directions such as sentiment, style, and topical emphasis. This paper investigates whether such semantic directions can transfer across modalities. Specifically, we ask: can steering vectors derived from the text-only model GPT-2 XL be applied to the speech-to-text model Whisper? Our motivation stems from the hypothesis that large-scale models trained on human language in different modalities may converge toward partially shared conceptual structures. Although Whisper processes audio and GPT-2 XL processes text, Whisper’s decoder functions as a conditional language model, suggesting potential overlap in latent semantic representations. However, cross-model transfer is nontrivial due to architectural and dimensional mismatches and the possibility that similar concepts are encoded differently across models. We construct steering vectors in GPT-2 XL using Contrastive Activation Addition (CAA). For each concept, we compute activation differences at layer 20 between semantically opposed word pairs (e.g., love − hate, anger − calm). To support dimensionality reduction, we augment these pairs via multilingual translation, generating over 400 contrastive samples per concept. This produces high-dimensional steering vectors (1600 features) that must be adapted to Whisper’s encoder representation (384 features, sequence length 1500). We evaluate three feature-reduction strategies to map GPT-2 XL vectors into Whisper’s activation space: (1) Principal Component Analysis (PCA), (2) an autoencoder trained to compress 1600-dimensional vectors into 384 dimensions, and (3) Gaussian random projection. Each reduced vector is applied additively across all four of Whisper’s encoder layers during inference. We test each steering type on ten audio clips (five per concept) and vary steering strength. Results show partial but inconsistent cross-modal steering. In some cases, sentiment shifts occur in the expected direction.For example, “I saw the movie and felt bad” becomes “I saw the movie and felt mad” under anger steering. Similarly, negative statements occasionally soften under love steering. However, many outputs remain unchanged, and higher-magnitude steering frequently induces incoherence, including repeated tokens or degenerate outputs. Semantic shifts often involve near-synonyms, raising the possibility that changes reflect local lexical proximity rather than robust conceptual transfer. Comparative analysis reveals that PCA-based reductions produce the most stable behavior at moderate steering magnitudes. Cosine similarity analysis shows that reduced vectors from different methods are highly aligned, with PCA nearly identical to other reductions in representation space. Additionally, transferred vectors are not orthogonal to Whisper’s internal steering directions derived from multilingual embeddings, suggesting partial alignment between GPT-2 XL’s semantic axes and Whisper’s latent structure. Overall, our findings suggest that cross-modal steering is fragile. While GPT-2 XL steering vectors can influence Whisper outputs, they do not reliably transfer their intended semantic content.GPT-2 XL is generative, whereas Whisper performs conditional transcription; this functional asymmetry may limit transfer fidelity. This work contributes to interpretability research by probing the portability of internal representations across models and modalities. It raises foundational questions about whether large models converge toward shared conceptual geometries and how such geometries can be aligned. Future work should explore learned cross-model translation layers, evaluate larger and more diverse datasets, and extend experiments to non-OpenAI models. Bridging representational spaces across modalities may provide deeper insight into the structure and universality of learned concepts.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 74
Loading