A reproducibility study of “User-item fairness tradeoffs in recommendations”

TMLR Paper4303 Authors

21 Feb 2025 (modified: 16 Apr 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recommendation systems are necessary to filter the abundance of information presented in our everyday lives. A recommendation system could exclusively recommend items that users prefer the most, potentially resulting in certain items never getting recommended. Conversely, an exclusive focus on including all items could hurt overall recommendation quality. This gives rise to the challenge of balancing user and item fairness. The paper “User-item fairness tradeoffs in recommendations” by Greenwood et al. (2024) explores this tradeoff by developing a theoretical framework that optimizes for user-item fairness constraints. Their theoretical framework suggests that the cost of item fairness is low when users have diverse preferences, and may be high for users whose preferences are misestimated. They empirically measured these phenomena by creating their own recommendation system on arXiv preprints, and confirmed that the cost of item fairness is low for users with diverse preferences. However, contrary to their theoretical expectations, misestimated users do not encounter a higher cost of item fairness. This study investigates the reproducibility of their research by replicating the empirical study. Additionally, we extend their research in two ways: (i) verifying the generalizability of their findings on a different dataset (Amazon books reviews), and (ii) analyzing the tradeoffs when recommending multiple items to a user instead of a single item. Our results further validate the claims made in the original paper. We concluded the claims hold true when recommending multiple items, with the cost of item fairness decreasing as more items are recommended.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We would like to thank the reviewers for their valuable feedback. We have incorporated their suggestions in our revised manuscript as follows: - We clarified the part of how the embeddings for the Amazon dataset were created. - We clearly stated that the recommendation policy represents rather a distribution than a deterministic vector in section 3.3.3. - We reduced ambiguity on fairness definitions. - We changed "system" to the "framework" in section 3.1. - We added graphs in the Appendix showing the distribution of subcategories for both samples of the test set we used in our experiments. - We clarified how the similarity scores were stored in the original code base and which changes we made to the code. - We added limitations of the original paper in the discussion and conclusion. - In the introduction, we have clarified why the Amazon books review dataset is a valuable dataset to evaluate the generalizability of the original work in other domains. In the discussion and conclusion, we now explicitly mention extrapolating this method to even more datasets as a direction for future research. - We better structured the pipeline for the arXiv dataset experiment, and more thoroughly elaborated on the embedding preparation, similarity scoring, and the logistic regression process. - We added a paragraph reflecting upon the discrepancy between the theoretical claim and empirical result in the discussion and conclusion section. - We added a paragraph describing the relation of our findings to earlier research in the discussion and conclusion section. - After the first submission, we received a response from the authors of the original paper, which we included in the final remarks.
Assigned Action Editor: ~Dennis_Wei1
Submission Number: 4303
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview