articleMar 1, 2010Closed access

Near-optimal Regret Bounds for Reinforcement Learning

Abstract

For undiscounted reinforcement learning in Markov decision processes (MDPs) we consider the total regret of a learning algorithm with respect to an optimal policy. In order to describe the transition structure of an MDP we propose a new parameter: An MDP has diameter D if for any pair of states s,s ′ there is a policy which moves from s to s ′ in at most D steps (on average). We present a reinforcement learning algorithm with total regret Õ(DS √ AT) after T steps for any unknown MDP with S states, A actions per state, and diameter D. A corresponding lower bound of Ω ( √ DSAT) on the total regret of any learning algorithm is given as well. These results are complemented by a sample complexity bound on the…

Citation impact

711
total citations
FWCI
39.10
Percentile
100%
References
26
Citations per year

Authors

3

Topics & keywords

Keywords
  • Regret
  • Markov decision process
  • Reinforcement learning
  • Logarithm
  • Upper and lower bounds
  • Mathematical optimization
  • Sample complexity
  • Mathematics
UN Sustainable Development Goals
  • Peace, Justice and strong institutions
No related works found for this paper.