Model-free reinforcement learning in non-stationary Markov Decision Processes

Mao, Weichao

Model-free reinforcement learning in non-stationary Markov Decision Processes

Mao, Weichao

Permalink

https://hdl.handle.net/2142/110451

Description

Title

Model-free reinforcement learning in non-stationary Markov Decision Processes

Author(s)

Mao, Weichao

Issue Date

2021-04-09

Director of Research (if dissertation) or Advisor (if thesis)

Başar, Tamer

Department of Study

Electrical & Computer Eng

Discipline

Electrical & Computer Engr

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

M.S.

Degree Level

Thesis

Date of Ingest

2021-09-17T01:10:43Z

Keyword(s)

reinforcement learning
Markov decision process
non-stationarity

Abstract

Reinforcement learning (RL) studies the problem where an agent maximizes its cumulative reward through sequential interactions with an initially unknown environment, usually modeled by a Markov Decision Process (MDP). The classical RL literature typically assumes that the state transition functions and the reward functions of the MDP are time-invariant. Such a stationary model, however, cannot capture the dynamic nature of many sequential decision-making problems in practice. In this thesis, we consider the problem of reinforcement learning in \emph{non-stationary} MDPs. In our setting, both the reward functions and the state transition distributions are allowed to vary over time, either gradually or abruptly, as long as their cumulative variation magnitude does not exceed certain budgets. We propose an algorithm, named Restarted Q-Learning with Upper Confidence Bounds (RestartQ-UCB), for this setting, which adopts a simple restarting strategy and an extra optimism term. We theoretically show that RestartQ-UCB outperforms existing solutions in terms of dynamic regret, a notion commonly utilized to measure the performance of an online learning algorithm in a non-stationary environment. Specifically, RestartQ-UCB with Freedman-type bonus terms achieves a dynamic regret bound of $\widetilde{O}(S^{\frac{1}{3}} A^{\frac{1}{3}} \Delta^{\frac{1}{3}} H T^{\frac{2}{3}})$, where $S$ and $A$ are the numbers of states and actions, respectively, $\Delta>0$ is the variation budget, $H$ is the number of time steps per episode, and $T$ is the total number of time steps. We further show that our algorithm is nearly optimal by establishing an information-theoretical lower bound of $\Omega(S^{\frac{1}{3}} A^{\frac{1}{3}} \Delta^{\frac{1}{3}} H^{\frac{2}{3}} T^{\frac{2}{3}})$, which to the best of our knowledge is the first impossibility result that characterizes the fundamental limits of non-stationary RL in general. To the best of our knowledge, RestartQ-UCB is the first model-free algorithm for non-stationary RL. Compared with model-based solutions, our algorithm is more time- and space-efficient, flexible, and compatible with the model deep RL architectures. We empirically evaluate RestartQ-UCB on RL tasks with both abrupt and gradual types of non-stationarity. Simulation results validate the advantages of RestartQ-UCB in terms of cumulative rewards and computational efficiency. We further demonstrate the power of our results through a ``learning to collaborate'' example in the context of multi-agent RL, where non-stationarity is a key challenge.

Graduation Semester

2021-05

Type of Resource

Thesis

Permalink

http://hdl.handle.net/2142/110451

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Dissertations and Theses - Electrical and Computer Engineering

Dissertations and Theses in Electrical and Computer Engineering

Model-free reinforcement learning in non-stationary Markov Decision Processes

Mao, Weichao

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Electrical and Computer Engineering

Log In