Abstract: This paper proposes to use both audio input and subject information to predict the personalized preference of two audio segments with the same content in different qualities. A siamese network is used to compare two inputs and predict the preference. Several different structures for each side of the siamese network are investigated. The baseline structure which uses only audio information involves using a pretrained audio encoder followed by fully connected layers. In several different proposed structures, the approach of concatenating subject information with audio embedding before feeding it into fully connected layers outperforms the baseline model the most, resulting in an increase in overall accuracy from 77.56% to 78.04%. Experimental results also demonstrate that utilizing the complete set of subject information, which includes age, gender, and headphone/earphone specifications such as impedance, frequency response range, and sensitivity, is more effective than using a subset of this information.
Loading