Value learning through Bellman residuals and neural function approximations in deterministic systems

Saleh, Ehsan

Value learning through Bellman residuals and neural function approximations in deterministic systems

Saleh, Ehsan

Permalink

https://hdl.handle.net/2142/110566

Description

Title: Value learning through Bellman residuals and neural function approximations in deterministic systems
Author(s): Saleh, Ehsan
Issue Date: 2021-04-27
Director of Research (if dissertation) or Advisor (if thesis): Bretl, Timothy M
Department of Study: Computer Science
Discipline: Computer Science
Degree Granting Institution: University of Illinois at Urbana-Champaign
Degree Name: M.S.
Degree Level: Thesis
Date of Ingest: 2021-09-17T01:11:17Z
Keyword(s): Bellman Residual Minimization, Deterministic Systems, Reinforcement Learning, Function Approximation, Temporal Differences, TD-Learning
Abstract: Deterministic Bellman residual minimization is the problem of learning values in a Markov Decision Process (MDP) by optimizing for the Bellman residuals directly without using any heuristic surrogates. Compared to the traditional Approximate Dynamic Programming (ADP) methods, this approach can have both advantages and disadvantages. One of the advantages of such an approach is that the underlying optimization problem can provably converge to the desired solution with better theoretical sample efficiency guarantees, while ADP heuristics are prone to divergence and have worse theoretical sample efficiency. On the other hand, the disadvantage of Bellman residual minimization is the requirement of two independent future samples, a.k.a. the double sampling requirement, in order to form an unbiased estimate of the Bellman residual. Despite that some versions of BRM have superior theoretical properties, the superiority comes from the double sampling trick, limiting their applicability to simulator environments with state resetting functionality. Hence the algorithms cannot be applied to non-simulator environments where state resetting is not available. This requirement can trivially be waived for deterministic dynamics, since any combination of observations and actions are guaranteed to generate identical future observations. For this double sampling requirement, Bellman residual minimization methods tended to be overlooked in the literature, and this work tries to evaluate the efficacy of this approach compared to the more common ADP heuristics. In this work, we make a simple observation that BRM can be applied without state resetting if the environment is deterministic, which Baird has hinted in his original paper on residual algorithms from 1995. The resulting algorithm is simple, convergent, and works well in benchmark control problems. Also, its implementation could be as simple as changing a line of code in their ADP counterparts. We compare Q-learning to its DBRM version theoretically, confirm the theoretical predictions from experiments, and also discover some surprising empirical behaviors and provide explanations.
Graduation Semester: 2021-05
Type of Resource: Thesis
Permalink: http://hdl.handle.net/2142/110566

Value learning through Bellman residuals and neural function approximations in deterministic systems

Saleh, Ehsan

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Computer Science

Value learning through Bellman residuals and neural function approximations in deterministic systems

Saleh, Ehsan

Permalink

Description

Owning Collections

Graduate Dissertations and Theses at Illinois PRIMARY

Dissertations and Theses - Computer Science

Log In