Hierarchical Self-Imitation Learning in Single-Agent Sparse Reward Environments

Chakraborty, Neeloy

Hierarchical Self-Imitation Learning in Single-Agent Sparse Reward Environments

Chakraborty, Neeloy

This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.

Permalink

https://hdl.handle.net/2142/110312

Description

Title

Hierarchical Self-Imitation Learning in Single-Agent Sparse Reward Environments

Author(s)

Chakraborty, Neeloy

Contributor(s)

Driggs-Campbell, Katherine

Issue Date

2021-05

Keyword(s)

reinforcement learning
sparse/delayed rewards
self-imitation learning
hierarchical learning

Abstract

Reinforcement learning problems with sparse and delayed rewards are challenging to solve because the algorithms explore environments to gain experience from high performing rollouts. Classical methods of encouraging exploration during training such as epsilon-greedy and noise-based exploration are not adequate on their own to explore large state spaces (Fortunato et al., 2018). Self-imitation learning (SIL) has been shown to allow an agent to learn to mimic high performing, long-horizon trajectories, but SIL is heavily reliant on exploration to find such trajectories (Oh et al., 2018). On the other hand, hierarchical learning (HL) may be unstable during training but incorporates noise and failures that effectively explore the environment and may learn tasks with higher sample efficiency (Levy et al., 2019). This thesis presents a single agent reinforcement learning algorithm that combines the effects of SIL and HL – Generative Adversarial Self Imitation the desired high-level path. We perform experiments in OpenAI’s Multi-Agent Particle Environment in sparse and delayed reward stochastic scenarios to identify benefits and hinderances of GASIL +HAC compared to DDPG, GASIL, and HAC in sample efficiency, generalizability, exploration, and goal reachability. Through these experiments, we find that GASIL+HAC has the potential to increase sample efficiency in stochastic tasks and increase the number of explored states during training. However, there is an inherent increase in instability of training hierarchical methods and SIL-based methods are still highly dependent on exploration to find high-return trajectories. Further experiments over several more seeds must be run to come to a complete conclusion on the effectiveness of the proposed algorithm. Learning + Hierarchical Actor-Critic (GASIL+HAC). GASIL+HAC represents the policy as multiple trainable levels of Deep Deterministic Policy Gradient (DDPG) optimizers from Lillicrap et al., (2016), where the purpose of the higher-level policies is to set waypoints to guide the lower-level policies to receive the highest cumulative return. The highest-level policy of the hierarchy is trained with GASIL on the sparse environment reward to set goals that imitate past well-performing trajectories, while the lower levels are trained on an artificial reward signal to set intermediate goals and achieve

Type of Resource

text

Language

Permalink

http://hdl.handle.net/2142/110312

Owning Collections

Senior Theses - Electrical and Computer Engineering PRIMARY

The best of ECE undergraduate research

Hierarchical Self-Imitation Learning in Single-Agent Sparse Reward Environments

Chakraborty, Neeloy

Permalink

Description

Owning Collections

Senior Theses - Electrical and Computer Engineering PRIMARY

Log In