Abstract: Highlights•A novel end-to-end multimodal transformer framework is proposed for aesthetics prediction.•An multimodal fusion layer is proposed to reflect the complex relationships among multimodal features.•A new aesthetically oriented attention block is proposed for image transformer.•A new aesthetic comments dataset on Western painting is presented.
Loading