Cross-Modal Knowledge Distillation for Efficient Material Recognition: Aligning Language Descriptions with Tactile Image Models

Mashood Mohammad Mohsan; Binzhao Xu; Basma Hasanen; Taimur Hassan; Irfan Hussain

Cross-Modal Knowledge Distillation for Efficient Material Recognition: Aligning Language Descriptions with Tactile Image Models

Mashood Mohammad Mohsan, Binzhao Xu, Basma Hasanen, Taimur Hassan, Irfan Hussain

Published: 01 Oct 2024, Last Modified: 03 Dec 2024BoB Workshop 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Cross Domain Distillation; VBTS; Material Recognition

TL;DR: We present a cross-modal knowledge distillation approach that significantly improves material recognition accuracy by transferring language knowledge to a vision tactile model, validated through real-world experiments with a UR10.

Abstract: Material recognition is critical in robotics and automation, enabling systems to accurately identify and classify materials for tasks like manipulation and sorting. In this paper, we introduce a novel approach that leverages cross-modal knowledge distillation, where a language-based teacher model distills knowledge into a vision-based student model trained on tactile images. Using the pre-trained Bidirectional Auto-Regressive Transformer (BART) model as the teacher, which processes language descriptions of tactile properties, and a Vision Transformer (ViT) as the student model, we align tactile and language representations through a knowledge distillation framework. Our distilled ViT model achieved significantly higher accuracy (74.70\%) in material recognition compared to a non-distilled ViT model (57.83\%), demonstrating the value of integrating language-based knowledge for enhanced tactile material recognition. We also perform real word experimentation UR10 manipulator performing material recognition task.

Submission Number: 6

Loading