Quantum-RAG and PunGPT2: Advancing Low-Resource Language Generation and Retrieval for the Punjabi Language
Keywords: Low-resource languages, PunGPT2, RAG, Pun-RAG, Pun-Instruct, QLoRA
TL;DR: We introduce PunGPT2, the first open-source Punjabi LLM suite with quantum-inspired retrieval, outperforming multilingual baselines and advancing NLP for low-resource languages.
Abstract: Despite rapid advances in large language models (LLMs), low-resource languages remain excluded from NLP, limiting digital access for millions. We present PunGPT2, the first fully open-source Punjabi generative model suite, trained on a 35GB corpus covering literature, religious texts, news, social discourse etc. PunGPT2 captures Punjabi’s syntactic and morphological richness through a tokenizer optimized for Gurmukhi and Shahmukhi scripts. We introduce Pun-RAG, a retrieval-augmented framework integrating PunGPT2 with a FAISS retriever over a curated Punjabi knowledge base, and Pun-Instruct, an instruction-tuned variant using QLoRA for robust zero-shot summarization, translation, and question answering. Our key innovation, Quantum-RAG, fuses sparse, dense, and quantum kernel embeddings for efficient, context-aware retrieval with low memory overhead, marking the first practical quantum-inspired retrieval in a lowresource LLM. Our models outperform multilingual baselines (mBERT, mT5, MuRIL, BLOOM) on FLORES-200, IndicGenBench, and a new PunjabiEval suite. This work advances inclusive NLP and offers a scalable framework for under-represented languages.
Primary Area: generative models
Submission Number: 20713
Loading