Hide-JEPA: A Joint Embedding Predictive Architecture for Cultural Cogni-tion in Chinese Classical Gardens

19 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Cultural cognition; joint embedding prediction; Hide-JEPA; multimodal learning; YAIG dataset
TL;DR: This paper introduces a dataset and the Hide-JEPA model to help AI understand classical Chinese gardens, achieving ~80% accuracy and advancing cultural representation in computer vision and digital humanities.
Abstract: Chinese classical gardens, with their unique "form-meaning-atmosphere" cultural connotations, pose a sig-nificant challenge for cultural heritage identification. Enabling deep learning models to learn this complex "cultural grammar" with limited data is a core interdisci-plinary challenge. This paper addresses this by construct-ing a reproducible experimental pipeline based on 11,804 real-world images of Chinese classical gardens to over-come the limitations of existing AI visual systems. The images were sourced from public web crawling (6,421), museum scans (3,865), and author photography (1,518), and underwent rigorous quality control. We created the YAIG-mini dataset with a three-level annotation system: L1 (6 types), L2 (35 types), and L3 (object detection boxes for buildings, water, rocks, plants, etc.). To ensure quality, a three-stage process involving automatic anno-tation, double-blind cross-review, and expert final review was implemented, achieving high consistency (L1: 0.96, L2: 0.88, L3 mAP@0.5: 0.79). On this dataset, we pro-pose Hide-JEPA, an innovative joint embedding predic-tion architecture that integrates self-supervised learning and multimodal feature fusion for deep semantic analysis. The experimental pipeline validates the ImageNet base-line, the gains from self-supervised I-JEPA pre-training, and the model's effectiveness in cultural consistency dis-crimination. Hide-JEPA achieves outstanding perfor-mance in cultural cognition tasks, with a classification accuracy of approximately 80%, providing a reproducible foundation for research in this interdisciplinary field.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 16472
Loading