LLaVA-PlantDiag: Integrating Large-scale Vision-Language Abilities for Conversational Plant Pathology Diagnosis
Abstract: Human X Machine Conversational Systems and Multi-turn Generative AI models have established impressive results making the process of language generation and conversation highly interactive in nature. Owing to its versatile functionalities, the models have catered to diverse fields, one of them being Botany and Plant Life. These advancements have contributed and aided phytologists to carry out further research by identifying the plant disease in order to extract other vital information. The current research methods have accomplished the same by implementing the traditional Convolution Neural Networks(CNN’s) which not only fail to capture temporal information but limits the interactivity of the user with the model. In this paper, we propose LLaVA-PlantDiag, a visual question-answering assistant which analyzes the data of multiple modalities and answers open-ended questions on plant pathology in a conversational manner. The primary notion is to construct a VQA image-description dataset from the PlantVillage dataset, utilize GPT-3.5 to form open-ended, caption-connected question-answer pair and fine-tune large general-domain vision-language model on the custom dataset. The results demonstrate that LLaVA-PlantDiag significantly outperforms state-of-the-art models such as GPT-4 Vision, Gemini, and other open-source models in two key tasks: Phytopathological multi-turn VQA and Classification. LLaVA-PlantDiag achieves a relative score of 64.7, surpassing the score of 48.7 achieved by GPT-4 Vision on Vision-Language tasks. LLaVA-PlantDiag also obtains an impressive 96% accuracy in classification, compared to the second-best performer, IDEFICUS, which scored 85%.
Loading