Abstract: Highlights•Proposes a cross-scale alignment framework without scale constraints.•Generates scale-adaptable semantic units by adaptive semantic aggregation.•Ensures semantic consistency with Position- and Co-occurrence-aware subsequences.•Filters out weak semantic associations by adaptive semantic filter.•Learns accurate image-text similarity by semantic unit alignment.
External IDs:dblp:journals/pr/LiuXGC25
Loading