On-line Learning in Tree MDPs by Treating Policies as Bandit Arms

📰 ArXiv cs.AI

Learn to apply bandit algorithms to tree MDPs for online learning and regret minimization in sequential games

advanced Published 7 May 2026

Action Steps

Formulate a tree MDP to model a sequential game with perfect recall and stationary opponents
Treat policies as bandit arms to apply bandit algorithms for online learning
Use PAC or regret-minimisation regimes to evaluate and improve the model's performance
Apply exploration-exploitation trade-offs to balance learning and decision making
Analyze the regret bounds of the proposed algorithm to ensure efficient learning

Who Needs to Know This

Researchers and engineers working on sequential games, decision making, and reinforcement learning can benefit from this approach to improve their models' performance and adaptability

Key Insight

💡 Treating policies as bandit arms enables the application of well-known bandit algorithms to tree MDPs for efficient online learning