Abstract: Highlights • Novel large-scale real-world dataset of music listening records. • Debiasing yields slight improvements of fairness recommendation algorithms. • Formalizing and measuring the extent of compounding data biases by recommendation algorithms. Abstract Although recommender systems (RSs) play a crucial role in our society, previous studies have revealed that the performance of RSs may considerably differ between groups of individuals with different characteristics or from different demographics. In this case, a RS is considered to be unfair when it does not perform equally well for different groups of users. Considering the importance of RSs in the distribution and consumption of musical content worldwide, a careful evaluation of fairness in the context of music RSs is crucial. To this end, we first introduce LFM-2b, a novel large-scale real-world dataset of music listening records, comprising a subset to investigate bias of RSs regarding users’ demographics. We then define a notion of fairness based on the performance gap of a RS between the users with different demographics, and evaluate a variety of collaborative filtering algorithms in terms of accuracy and beyond-accuracy metrics to explore the fairness in the RS results toward a specific gender group. We observe the existence of significant discrepancies (unfairness) between the performance of algorithms across male and female user groups. Based on these discrepancies, we explore to what extent recommender algorithms lead to intensifying the underlying population bias in the final results. We also study the effect of a resampling strategy, commonly used as debiasing method , which yields slight improvements in the fairness measures of various algorithms while maintaining their accuracy and beyond-accuracy performance.