SpeechCraft: An Integrated Data Generation Pipeline from Videos for LLM Finetuning

Jyothi Swaroop Arlagadda N; Venkata Sai Mahesh Vuppalapati; Shrey Dharmendra Modi; Rahul Vishwakarma; Heer Shah

SpeechCraft: An Integrated Data Generation Pipeline from Videos for LLM Finetuning

Jyothi Swaroop Arlagadda N, Venkata Sai Mahesh Vuppalapati, Shrey Dharmendra Modi, Rahul Vishwakarma, Heer Shah

Published: 01 Jan 2024, Last Modified: 12 Feb 2025ICAC 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Customizing Large Language Models (LLMs) for specific tasks demands high-quality, domain-specific datasets. Existing solutions often struggle with extracting meaningful and structured data from unstructured video content, leading to inefficiencies and limitations in LLM training. This paper is motivated by the need to address these pain points and to develop a more effective method for generating high-quality datasets. We present a method of data generation pipeline that transforms unstructured video content into a structured format, which improves the LLM training. Our approach initiates with video processing techniques such as object detection, speech-to- text transcription, and sentiment analysis to extract crucial information. This information is then refined into customized datasets optimized for LLM input. Further stages involve adapting this structured data into different formats that align with LLM architectures, which enables flexibility in data utilization. The last phase focus on fine-tuning LLMs for specialized applications in both software environments and hardware integrations. We also demonstrate that our pipeline significantly enhances LLM performance in these applications. Our research findings emphasize the potential of video-based datasets to augment LLM capabilities, suggesting a scalable method that improves the efficiency of artificial intelligence training and expands the applicability of LLMs in current and future technological landscapes. Compared to traditional methods, our solution offers improved data quality, versatility in data formats, and superior model performance across diverse applications.

Loading