Abstract: Debt collection is utilized for risk control after credit card delinquency. The existing rule-based method tends to be myopic and non-adaptive due to the delayed feedback. Reinforcement learning (RL) has an inherent advantage in dealing with such task and can learn policies end-to-end. However, employing RL here remains difficult because of different interaction processes from standard RL and the notorious problem of optimistic estimations in the offline setting. To tackle these challenges, we first propose an Alternating Q-Learning (AQL) framework to adapt debt collection processes to comparable procedures in RL. Based on AQL, we further develop an Adversarial Conservative Alternating Q-Learning (ACAQL) to address the issue of overoptimistic estimations. Specifically, adversarial conservative value regularization is proposed to balance optimism and conservatism on Q-values of out-of-distribution actions. Furthermore, ACAQL utilizes the counterfactual action stitching to mitigate the overestimation by enhancing behavior data. Finally, we evaluate ACAQL on a real-world dataset created from Bank of Shanghai. Offline experimental results show that our approach outperforms state-of-the-art methods and effectively alleviates the optimistic estimation issue. Moreover, we conduct online A/B tests on the bank, and ACAQL achieves at least a 6% improvement of the debt recovery rate, which yields tangible economic benefits.
Loading