Maximum entropy on-policy reinforcement learning with monotonic policy improvement

Kapadia, Mustafa

Maximum entropy on-policy reinforcement learning with monotonic policy improvement

Kapadia, Mustafa

This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.

Permalink

https://hdl.handle.net/2142/121384

Description

Title

Maximum entropy on-policy reinforcement learning with monotonic policy improvement

Author(s)

Kapadia, Mustafa

Issue Date

2023-07-21

Director of Research (if dissertation) or Advisor (if thesis)

Salapaka, Srinivasa M

Department of Study

Mechanical Sci & Engineering

Discipline

Mechanical Engineering

Degree Granting Institution

University of Illinois at Urbana-Champaign

Degree Name

M.S.

Degree Level

Thesis

Keyword(s)

Entropy Maximization
Deep Reinforcement Learning
Natural Policy Gradient Methods
Combinatorial Optimization

Abstract

This thesis focuses on the utilization of the maximum entropy framework to train policies, which are renowned for their superior exploration and robustness, even in the presence of model and estimation errors. Our work encompasses the development of a theoretical foundation and a sample-based on-policy reinforcement learning algorithm based on the Maximum Entropy Principle (MEP). This algorithm ensures a consistent and monotonic improvement of policies across iterations, regardless of the initial policy. Furthermore, our theoretical advancements provide a framework for extending the solution of Paramterized Markov Decision Processes (ParaMDP) to address state and action spaces that were previously considered intractably large. We establish the necessary criteria for a well-posed maximum-entropy reinforcement learning problem in scenarios with an extensive number of states and actions, as well as infinite-horizon MDPs without a cost-free termination state. By incorporating the entropy over state action trajectories (or paths) into the objective function, we derive performance-estimation error bounds under MEP. This analysis involves drawing parallels and extending existing methods for on-policy reinforcement learning to cases where entropy maximization is added to the objective of the underlying optimization problem. We also introduce and analyze an ideal conservative policy iteration algorithm under MEP, and derive a practical sample-based algorithm that guarantees monotonic improvement. To evaluate the learning performance of our proposed algorithm, we conduct experiments on both continuous-control and discrete-control benchmark problems. We observe that resulting algorithms monotonic improvement with iterations and the training curve exhibits an O(1/T ) nature, where T are the number of iterations.

Graduation Semester

2023-08

Type of Resource

Thesis

Handle URL

https://hdl.handle.net/2142/121384

Copyright and License Information

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Graduate Theses and Dissertations at Illinois

Maximum entropy on-policy reinforcement learning with monotonic policy improvement

Kapadia, Mustafa

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Log In