Agnostic Reinforcement Learning with Low-Rank MDPs and Rich Observations

Ayush Sekhari

Christoph Dann

Yishay Mansour

Mehryar Mohri

Karthik Sridharan

NeurIPS (2021)

Download Google Scholar

Abstract

There have been many recent advances on provably efficient reinforcement learning in problems with rich observation spaces and general function classes. Unfortunately, common to all such approaches is a realizability assumption, that requires the function class to contain the optimal value function of true MDP model, that holds in hardly any real-world setting. In this work, we consider the more realistic setting of agnostic reinforcement learning with a policy class (that may not contain any near-optimal policy). We provide an algorithm for this setting and prove instance-dependent regret bounds when the MDP has small rank $d$. Our bounds scale exponentially with the rank $d$ in the worst case but importantly are polynomial in the horizon, number of actions and the log number of policies. We further show through a nearly matching lower bound that this dependency on horizon is unavoidable.

Research Areas

Machine Intelligence

Explore our many areas of focus

Building a collaborative ecosystem

Shaping the future together

Translating discovery into real-world impact

Agnostic Reinforcement Learning with Low-Rank MDPs and Rich Observations

Abstract

Research Areas

Meet the teams driving innovation