Abstract: Transformer-based LLMs achieve strong results but demand large computational and memory resources. We propose a hybrid quantum-classical approach that embeds variational quantum circuits into transformers for compression. By replacing portions of feed-forward and attention sub-layers with compact quantum modules, we cut parameters while preserving perplexity. Theoretical analysis shows these quantum circuits can approximate large transformations with fewer parameters, and experiments on LLaMA and Qwen confirm memory savings and faster inference. We also discuss quantum hardware feasibility and GPU-based simulation. Overall, our method offers a promising avenue for deploying LLMs in resource-constrained environments.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: Quantum neural network
Contribution Types: NLP engineering experiment, Approaches to low-resource settings
Languages Studied: English
Submission Number: 6834
Loading