COGGEN: BRIDGING VISUAL MIMICRY AND COGNITIVE ALIGNMENT IN FRONTEND CODE GENERATION VIA GAZE-ATTENTION DIFFUSION
Abstract: Contemporary frontend code generation paradigms predominantly rely on Multimodal Large Language Models (MLLMs)
to map static visual artifacts to Document Object Model
(DOM) structures. While effective at visual imitation, these
approaches suffer from ”interaction blindness”—generating
code that is visually faithful but functionally brittle or cognitively taxing for end-users. In this paper, we propose
CogGen, a neuro-symbolic framework that redefines interface synthesis as a trajectory optimization problem within
a latent user-intent manifold. Unlike direct pixel-to-code
translation, CogGen introduces a Gaze-Attention Diffusion
Bridge that hallucinates temporal interaction heatmaps prior
to code generation, effectively predicting user focus flow before syntax construction. We further propose a differentiable
”Cognitive Load Loss” function, trained on a massive dataset
of simulated eye-tracking and cursor dynamics, which penalizes generated ASTs (Abstract Syntax Trees) that induce high
friction or accessibility violations, even if they satisfy the
visual prompt. By integrating a lightweight, differentiable
rendering engine directly into the gradient loop, CogGen
optimizes for interaction ergonomics rather than mere pixel
reconstruction error. Experiments across the WebBench-2026
suite demonstrate that CogGen achieves a 42% reduction in
predicted user interaction latency and spontaneously corrects
”dark patterns” in UI designs, significantly outperforming
state-of-the-art MLLMs in functional robustness while maintaining high visual fidelity. This work establishes a new
frontier in human-centric program synthesis, shifting the
objective from visual mimicry to cognitive alignment.
Loading