articleMar 1, 2010Closed access
Near-optimal Regret Bounds for Reinforcement Learning
Abstract
For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s,s ′ there is a policy which moves from s to s ′ in at most D steps (on average). We present a reinforcement learning algorithm with total regret Õ(DS √ AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D. A corresponding lower bound of Ω ( √ DSAT) on the total regret of any learning algorithm is given as well. These results are complemented by a sample complexity bound on the…
Citation impact
711
total citations
- FWCI
- 39.10
- Percentile
- 100%
- References
- 26
Citations per year
Authors
3Topics & keywords
Topics
Keywords
- Regret
- Markov decision process
- Reinforcement learning
- Logarithm
- Upper and lower bounds
- Mathematical optimization
- Sample complexity
- Mathematics
UN Sustainable Development Goals
- Peace, Justice and strong institutions
No related works found for this paper.