Withdraw
Loading…
Value learning through Bellman residuals and neural function approximations in deterministic systems
Saleh, Ehsan
Loading…
Permalink
https://hdl.handle.net/2142/110566
Description
- Title
- Value learning through Bellman residuals and neural function approximations in deterministic systems
- Author(s)
- Saleh, Ehsan
- Issue Date
- 2021-04-27
- Director of Research (if dissertation) or Advisor (if thesis)
- Bretl, Timothy M
- Department of Study
- Computer Science
- Discipline
- Computer Science
- Degree Granting Institution
- University of Illinois at Urbana-Champaign
- Degree Name
- M.S.
- Degree Level
- Thesis
- Keyword(s)
- Bellman Residual Minimization, Deterministic Systems, Reinforcement Learning, Function Approximation, Temporal Differences, TD-Learning
- Abstract
- Deterministic Bellman residual minimization is the problem of learning values in a Markov Decision Process (MDP) by optimizing for the Bellman residuals directly without using any heuristic surrogates. Compared to the traditional Approximate Dynamic Programming (ADP) methods, this approach can have both advantages and disadvantages. One of the advantages of such an approach is that the underlying optimization problem can provably converge to the desired solution with better theoretical sample efficiency guarantees, while ADP heuristics are prone to divergence and have worse theoretical sample efficiency. On the other hand, the disadvantage of Bellman residual minimization is the requirement of two independent future samples, a.k.a. the double sampling requirement, in order to form an unbiased estimate of the Bellman residual. Despite that some versions of BRM have superior theoretical properties, the superiority comes from the double sampling trick, limiting their applicability to simulator environments with state resetting functionality. Hence the algorithms cannot be applied to non-simulator environments where state resetting is not available. This requirement can trivially be waived for deterministic dynamics, since any combination of observations and actions are guaranteed to generate identical future observations. For this double sampling requirement, Bellman residual minimization methods tended to be overlooked in the literature, and this work tries to evaluate the efficacy of this approach compared to the more common ADP heuristics. In this work, we make a simple observation that BRM can be applied without state resetting if the environment is deterministic, which Baird has hinted in his original paper on residual algorithms from 1995. The resulting algorithm is simple, convergent, and works well in benchmark control problems. Also, its implementation could be as simple as changing a line of code in their ADP counterparts. We compare Q-learning to its DBRM version theoretically, confirm the theoretical predictions from experiments, and also discover some surprising empirical behaviors and provide explanations.
- Graduation Semester
- 2021-05
- Type of Resource
- Thesis
- Permalink
- http://hdl.handle.net/2142/110566
- Copyright and License Information
- Copyright by Ehsan Saleh 2021. All rights reserved.
Owning Collections
Graduate Dissertations and Theses at Illinois PRIMARY
Graduate Theses and Dissertations at IllinoisDissertations and Theses - Computer Science
Dissertations and Theses from the Dept. of Computer ScienceManage Files
Loading…
Edit Collection Membership
Loading…
Edit Metadata
Loading…
Edit Properties
Loading…
Embargoes
Loading…