Maximum entropy on-policy reinforcement learning with monotonic policy improvement
Kapadia, Mustafa
This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.
Permalink
https://hdl.handle.net/2142/121384
Description
Title
Maximum entropy on-policy reinforcement learning with monotonic policy improvement
Author(s)
Kapadia, Mustafa
Issue Date
2023-07-21
Director of Research (if dissertation) or Advisor (if thesis)
Salapaka, Srinivasa M
Department of Study
Mechanical Sci & Engineering
Discipline
Mechanical Engineering
Degree Granting Institution
University of Illinois at Urbana-Champaign
Degree Name
M.S.
Degree Level
Thesis
Keyword(s)
Entropy Maximization
Deep Reinforcement Learning
Natural Policy Gradient Methods
Combinatorial Optimization
Abstract
This thesis focuses on the utilization of the maximum entropy framework to train policies, which are renowned for their superior exploration and robustness, even in the presence of model and estimation errors. Our work encompasses the development of a theoretical foundation and a sample-based on-policy reinforcement learning algorithm based on the Maximum Entropy Principle (MEP). This algorithm ensures a consistent and monotonic improvement of policies across iterations, regardless of the initial policy. Furthermore, our theoretical advancements provide a framework for extending the solution of Paramterized Markov Decision Processes (ParaMDP) to address state and action spaces that were previously considered intractably large. We establish the necessary criteria for a well-posed maximum-entropy reinforcement learning problem in scenarios with an extensive number of states and actions, as well as infinite-horizon MDPs without a cost-free termination state. By incorporating the entropy over state action trajectories (or paths) into the objective function, we derive performance-estimation error bounds under MEP. This analysis involves drawing parallels and extending existing methods for on-policy reinforcement learning to cases where entropy maximization is added to the objective of the underlying optimization problem. We also introduce and analyze an ideal conservative policy iteration algorithm under MEP, and derive a practical sample-based algorithm that guarantees monotonic improvement. To evaluate the learning performance of our proposed algorithm, we conduct experiments on both continuous-control and discrete-control benchmark problems. We observe that resulting algorithms monotonic improvement with iterations and the training curve exhibits an O(1/T ) nature, where T are the number of iterations.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.