A very simple IPR example is to model the user preferences. In an IPR task, we can have two different IPR systems, each minimizing different loss functions. Assuming that none of those loss functions minimizes the interactions with the user directly, there is no way to know which is better in practice. Then, we can use reinforcement learning strategies on the fly in order to infer which the best system is while the system is interacting with the user. A possible approach, would be to use the system A for a while and then change to the system B for another period of time. After that sampling, we could have recorded some statistics, such as number of corrections by the user, or number of proposed hypotheses. It is important to highlight that this is an exploration phase. Once we have a proper model of the environment, the best system can be used for a while in an exploitation phase. Note that none of both strategies can be held forever if we want to minimize the regret.
