Abstract: The rapid development of generative models has significantly
advanced font generation. However, limited exploration has
been devoted to the evaluation and interpretability of graphical fonts. Existing quality assessment models can only provide basic visual analyses, such as recognizing clarity and
brightness, without in-depth explanations. To address these
limitations, we first constructed a large-scale multimodal
dataset named the Diversity Font Dataset (DFD), comprising 135,000 font-text pairs. This dataset encompasses a wide
range of generated font types and annotations, including
language descriptions and quality assessments, thus providing a robust foundation for training and evaluating font
analysis models. Based on this dataset, we developed a font
agent built upon a Vision-Language Model (VLM) aiming
to enhance font quality assessment and offer interpretable
question-answering capabilities. Alongside the original visual encoder in VLM, we integrated an Edge-Aware Traces
(EAT) module to capture detailed edge information of font
strokes and components. Furthermore, we introduced a Dynamic Direct Preference Optimization (D-DPO) strategy to
facilitate efficient model fine-tuning. Experimental results
demonstrate that Font-Agent achieves state-of-the-art performance on the established dataset. To further evaluate the
generalization ability of our algorithm, we conducted additional experiments on several public datasets. The results
highlight the notable advantage of Font-Agent in both assessing the quality of generated fonts and comprehending
their content.
Loading