Robustness and Interpretability of Hybrid Quantum NLP Models

Robustness and Interpretability of Hybrid Quantum NLP Models

ACL ARR 2026 January Submission6358 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Quantum NLP, hybrid quantum-classical models, quantum neural networks, explainable AI, Interface Grad-CAM, text classification

Abstract: Hybrid quantum classical models offer theoretical advantages in expressivity and robustness, yet their practical utility in natural language processing (NLP) is still not well studied. This paper examines how variational quantum circuits behave when they applied to highly compressed text representations. A hybrid model is proposed where a frozen DistilBERT encoder converts each sentence into a fixed eight-dimensional representation. This compact representation is then passed to either a classical multilayer perceptron or a variational quantum head, with both options having a similar number of trainable parameters. To interpret these models, the paper defines Interface Grad-CAM, a mechanism that attributes importance at the shared interface and maps saliency back to tokens. On SST-$2$, AG~News and Yelp Polarity, the quantum head consistently matches or slightly outperforms the classical head under the same eight dimensional bottleneck. More importantly, a Quantum Shield effect is observed: on SST-$2$, the synonym based attack success rate drops from about $47\%$ for the classical head to about $17\%$ for the quantum head, with a concurrent reduction on the other datasets. Gradient norm diagnostics at the interface indicate that this robustness does not arise from gradient masking. An entanglement analysis further reveals a modest negative correlation between global quantum entanglement and the entropy of token level importance scores, providing preliminary evidence that more highly entangled states may be associated with sharper, more focused explanations in the compressed feature space.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: Interpretability and Analysis of Models for NLP, Machine Learning for NLP, NLP Applications, Special Theme Track,

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency

Languages Studied: english

Submission Number: 6358

Loading