Keywords: Posterior Sampling, Foundation Model, Online Learning, Sequential Decision Making Under Uncertainty
TL;DR: Ensemble++ approximate Thomson smpling via shared-factor updates, reducing ensemble size while maintaining performance in both linear bandits and empowering GPT-based tasks.
Abstract: Thompson Sampling is a principled uncertainty-driven method for active exploration, but its real-world adoption is impeded by the high computational overhead of posterior maintenance in large-scale or non-conjugate settings. Ensemble-based approaches offer partial remedies, but often require a large ensemble size. This paper proposes the Ensemble++, a scalable agent that sidesteps these limitations by a shared-factor ensemble update architecture and a random linear combination scheme. We theoretically justify that in linear bandits, Ensemble++ agent only needs an ensemble size of $\Theta(d \log T)$ to achieve regret guarantees comparable to exact Thompson Sampling. Further, to handle nonlinear rewards and complex environments. we introduce a neural extension that replaces fixed features with a learnable representation, preserving the same underlying objective via gradient-based updates. Empirical results confirm that Ensemble++ agent excel in both sample efficiency and computational scalability across linear and nonlinear environments, including GPT-based contextual bandits for automated content moderation -- a safety-critical foundation model online decision-making task.
Submission Number: 17
Loading