Abstract: Despite the unparalleled capabilities of recently developed large language models (LLMs), controllable natural language generation (NLG) is still a tricky and important problem that aims at satisfying specific constraints in NLG to facilitate practical applications. Moreover, various severe limitations compared to small language models, including applicability, responsiveness, and environmental impact, also prevent the application of LLMs in many practical scenarios. Therefore, in this work, we explore Image-guided Story Ending Generation (IgSEG) with small language models, which requires the agent to respond to a multimodal environment and generate text following visual controls. Accordingly, we propose Vision-Controllable Language Model (VCLM) which can integrate cross-modal inference ability into both trained-from-scratch and pretrained language models. Firstly, we devise a multimodal-contextual cloud knowledge retrieval, which periodically retrieves textual knowledge from a cloud LLM based on the multimodal query. Secondly, we design a multimodal prototype condensation module to condense key information in multimodal dependent representations. Thirdly, text-only LM layers are interleaved with cross-modal intervention layers to infuse multimodal information. Finally, we propose vision-controlled reinforcement learning to explicitly constrain our model to follow visual controls. In experiments, extensive automatic evaluation results and human evaluation results indicate the significant performance improvements of our proposed VCLM over state-of-the-art methods on two benchmark datasets. Our code is available at https://github.com/LivXue/VCNLG.
External IDs:doi:10.1109/tmm.2026.3679122
Loading