Abstract: Recent multimodal foundation models are primarily trained on English or high resource European language data, which limits their applicability to other medium and low-resource languages, such as the Indian languages. To address this limitation, we introduce Chitrarth (Chitra: Image; Artha: Meaning), an inclusive Vision-Language Model (VLM), specifically targeting the rich linguistic diversity and visual reasoning across 10 prominent Indian languages. Our model effectively integrates a state-of-the-art (SOTA) multilingual Large Language Model (LLM) with a vision module, primarily trained on multilingual image-text data. Furthermore, we also introduce BharatBench, a comprehensive framework for evaluating VLMs across various low resource languages, ultimately contributing to more diverse and effective AI systems. Our model presents SOTA results for benchmarks across Indian languages while retaining its efficiency in English. Through our research, we aim to set new benchmarks in multilingual-multimodal capabilities, offering substantial improvements over existing models and establishing a foundation for facilitating future advancements in this arena.
Loading