High-Throughput Ingestion for Video Warehouse: Comprehensive Configuration and Effective Exploration

Zepeng Li, baiyan zhang, Dongxiang Zhang, Huan Li, Kian-Lee Tan, Gang Chen

Published: 17 Jun 2025, Last Modified: 26 Jul 2025OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: The innovative concept of Video Extract-Transform-Load (V-ETL), recently proposed in Skyscraper, reinterprets large-scale video analytics as a data warehousing problem. In this study, we aim at enabling real-time and high-throughput ingestion of hundreds of video streams and maximizing the overall accuracy, by constructing a proper ingestion plan for each video stream. To achieve the goal, we construct a comprehensive configuration space that takes into account the configurable components in the entire ingestion pipeline, including numeric parameters and categorical options such as visual inference model selection. The new space is 1 × 107 times larger than existing approaches, rendering them as sub-optimal points in our space. To effectively explore the huge and heterogeneous configuration space, we devise an accuracy-aware search strategy based on graph embedding and reinforcement learning to establish the runtime-quality Pareto frontier. To reduce the configuration exploration cost for all video streams, we cluster video streams with similar contexts and adopt mixed integer programming to maximize the overall ingestion accuracy while ensuring the real-time ingestion requirement. In the experimental evaluation with one NVIDIA GeForce RTX 4090 GPU card, our Hippo can support real-time ingestion with 300 video streams and secures an ingestion accuracy that exceeds its competitors by more than 30%.