Abstract: Companies like Microsoft Research and Google DeepMind have highlighted limitations in GPTs' next-word prediction approach, such as poor planning, memory, and reasoning. GPTs generate text locally without global task understanding, struggling with complex logic and unseen code generation, as confirmed by our code comprehension studies.
We propose a new heterogeneous image paradigm for code encoding, inspired by diffusion techniques in image and protein structure generation. This approach encodes code globally, combining image and protein-like structures, avoiding autoregressive constraints. Using a CLIP-inspired text-to-code encoder, the model maps text to code for various tasks.
Trained on 456,360 text-code pairs with self-supervised learning, the model achieves zero-error predictions, bridging text and code encoding spaces. This work paves the way for diffusion-based code generation, overcoming GPTs' limitations.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: next-word prediction, CLIP, CLCP, global information, transferable
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: programming language
Submission Number: 6935
Loading