Pareto Efficiency of Learning-Forgetting Trade-Off in Neural Language Model Adaptation

Jerome R. Bellegarda

Published: 2023, Last Modified: 06 May 2026ASRU 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: To achieve effective on-device textual prediction, language modeling must account for changing user style, emerging idiosyncratic expressions, and other evolving linguistic events. But typical domain adaptation techniques are not designed to preserve domain-independent behavior, leading to a data-dependent trade-off between learning and forgetting. This paper studies the Pareto efficiency of this compromise within an adversarial adaptation framework under plausible real-world conditions. The idea is that adversarial constraints placed on the (sparse) on-device distribution prevent the adapted distribution from straying too far from the (dense) initial distribution resulting from pre-training. Cross-domain experiments confirm that such distribution-level constraints help retain generic linguistic knowledge acquired during pre-training. Prediction candidates therefore suffer less from unpredictable forgetting, while being better aligned with every user’s fine-grained personal interests.

External IDs:dblp:conf/asru/Bellegarda23