Reproducibility study of FACTER: Fairness-Aware Conformal Thresholding and Prompt Engineering

Reproducibility study of FACTER: Fairness-Aware Conformal Thresholding and Prompt Engineering

TMLR Paper9343 Authors

31 May 2026 (modified: 03 Jun 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: We present a reproducibility study of FACTER, a post-hoc framework that combines conformal thresholding with iterative prompt engineering to mitigate demographic bias in black-box LLM-based recommender systems. Using the released codebase and experimental setting from the original paper, we evaluate FACTER on MovieLens-1M and Amazon Movies & TV with various LLM backbones. We assess fairness using the reported violation-based criterion group and counterfactual metrics (SNSR, CFR), and measure recommendation quality via catalog-mapped ranking metrics (NDCG@10, Recall@10) alongside the validity rate of generated items (Valid@10). Across datasets and supported backbones, we reproduce FACTER’s key qualitative behavior, with fairness violations decreasing sharply and converging within a small number of calibration rounds. However, unlike the original study, we observe a collapse in recommendation quality, largely driven by low validity of generated movie titles under open-vocabulary generation, inaccurate item-mapping and evaluation assumptions. We further identify and resolve multiple implementation and reproducibility issues in the released code, providing a cleaner and easier-to-run codebase to support future replication. Overall, our findings support FACTER’s effectiveness in reducing measured fairness violations, but with a higher loss in utility.

Submission Type: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Jingyan_Wang1

Submission Number: 9343

Loading