ReVision: A Dataset and Baseline VLM for Privacy-Preserving Task-Oriented Visual Instruction Rewriting
Abstract: Efficient and privacy-preserving multimodal interaction is essential as AR, VR, and modern smartphones with powerful cameras become primary interfaces for human-computer communication. Existing powerful large vision-language models (VLMs) enabling multimodal interaction often rely on cloud-based processing, raising significant concerns about (1) visual privacy by transmitting sensitive vision data to servers, and (2) their limited real-time, on-device usability. This paper explores **Visual Instruction Rewriting**, a novel approach that transforms multimodal instructions into text-only commands, allowing seamless integration of lightweight on-device instruction rewriter VLMs **(250M parameters)** with existing conversational AI systems, enhancing vision data privacy. To achieve this, we present a dataset of over 39,000 examples across 14 domains and develop a compact VLM, pretrained on image captioning datasets and fine-tuned for instruction rewriting. Experimental results, evaluated through NLG metrics such as BLEU, METEOR, and ROUGE, along with semantic parsing analysis, demonstrate that even a quantized version of the model (<500MB storage footprint) can achieve effective instruction rewriting, thus enabling privacy-focused, multimodal AI applications.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: vision language navigation; cross-modal pretraining; image text matching; cross-modal content generation; vision question answering; cross-modal application
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models, Data resources
Languages Studied: English
Previous URL: https://openreview.net/forum?id=KMIlx513aT
Explanation Of Revisions PDF: pdf
Reassignment Request Area Chair: Yes, I want a different area chair for our submission
Reassignment Request Reviewers: Yes, I want a different set of reviewers
Justification For Not Keeping Action Editor Or Reviewers: Despite our proactive and thorough efforts during the rebuttal phase, which included a prompt and detailed responses to reviewer concerns, and multiple follow-ups, there has been no engagement from the reviewers. This includes two reviewers (td5w and ps93) whose reviews contained demonstrable misunderstandings (e.g., misreading our privacy framing, requesting comparisons with post-submission work, and overlooking experimental results that directly addressed their questions). Without any acknowledgment or revision from reviewers, our clarifications are rendered moot. We find it disheartening that the process has left no mechanism for author input to be meaningfully considered. We raised this issue earlier, hoping for oversight or intervention, but received no response. Therefore, we respectfully request a change of reviewers.
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: Yes
A2 Elaboration: Limitations
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: 1, 2
B2 Discuss The License For Artifacts: Yes
B2 Elaboration: 3
B3 Artifact Use Consistent With Intended Use: Yes
B3 Elaboration: 1,2,3
B4 Data Contains Personally Identifying Info Or Offensive Content: Yes
B4 Elaboration: 3, Ethical Considerations
B5 Documentation Of Artifacts: Yes
B5 Elaboration: 3
B6 Statistics For Data: Yes
B6 Elaboration: 3
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: 3, 4, 5
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: 4, 6
C3 Descriptive Statistics: Yes
C3 Elaboration: 3, 4, 6
C4 Parameters For Packages: Yes
C4 Elaboration: 3, 4, 5, 6
D Human Subjects Including Annotators: Yes
D1 Instructions Given To Participants: Yes
D1 Elaboration: 3
D2 Recruitment And Payment: Yes
D2 Elaboration: 3
D3 Data Consent: Yes
D3 Elaboration: 3
D4 Ethics Review Board Approval: N/A
D4 Elaboration: We adhered to Amazon Mechanical Trurk's policies, terms and conditions
D5 Characteristics Of Annotators: N/A
D5 Elaboration: We adhered to Amazon Mechanical Trurk's policies, terms and conditions
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: Yes
E1 Elaboration: Ethical Considerations
Author Submission Checklist: yes
Submission Number: 875
Loading