# DKP

# Overview
In this paper, we propose a lightweight yet effective paradigm named DKP for ITR, which tackles the challenges of parameter inefficiency and semantic misalignment in CLIP-based VLMs. By identifying a semantically crystallized key layer through empirical attention analysis, we introduce a KPA strategy that substantially reduces training overhead while preserving cross-modal alignment quality. Furthermore, to mitigate semantic drift and strengthen intra-modal coherence, we design a self-supervised SCD mechanism that leverages the model’s inherent relational structure without relying on external knowledge.


![](model/framework.png)


# Setup

python >= 3.9

pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 --index-url https://download.pytorch.org/whl/cu118

pip install transformers sentence-transformers tqdm scikit-learn ftfy


# Task

## For COCO:
torchrun --nproc_per_node=4 --master-port 15160 retrieval.py --config "./configs/vitb32/coco/kl_7.yaml"
torchrun --nproc_per_node=4 --master-port 15160 retrieval.py --config "./configs/vitb16/coco/kl_7.yaml"


## For Flick30k:
torchrun --nproc_per_node=4 --master-port 15160 retrieval.py --config "./configs/vitb32/flickr/kl_7.yaml"
torchrun --nproc_per_node=4 --master-port 15160 retrieval.py --config "./configs/vitb16/flickr/kl_7.yaml"
