The csv files refers to the captions of the extended ROCOcap dataset.

The first and the second columns in csv files correspond to the figures paired with original ROCO captions. The subsequent columns denotes the various enriched captions based on MLLM captioning (e.g., $T^1_{aug},T^2_{aug},T^3_{aug},T^4_{aug}$ in Table 4 refer to the third,the fourth, the fifth and the sixth  columns of the csv file respectively). 

The full enriched dataset (ROCOcap) will be shared via Huggingface upon acceptance.
