Keywords: computational creativity, model alignment
Abstract: We explore the potential for creative neural network models to learn from each other. To enable this, we train two policy models using the exactly same architectures, configurations, and data, but with different random seeds. Then, we obtain a judge model that automatically rates the performance. We take samples from each policy model, rate them, and optimize each model by maximizing the probability of the better sample and minimizing the probability of the worse sample, using a popular model alignment technique, Direct Preference Optimization (DPO), in an online manner. The results show that our approach effectively improves the performance of models in three distinct, open-ended creative tasks: symbolic music generation, lyric generation, and lyric translation. However, it shows minimal benefit for a closed-ended task, Georgian-to-English machine translation.
Track: Paper Track
Confirmation: Paper Track: I confirm that I have followed the formatting guideline and anonymized my submission.
Submission Number: 27
Loading