HIERARCHICAL SELF-IMITATION LEARNING IN SINGLE-AGENT SPARSEREWARD ENVIRONMENTS

Chakraborty, Neeloy

HIERARCHICAL SELF-IMITATION LEARNING IN SINGLE-AGENT SPARSEREWARD ENVIRONMENTS

Chakraborty, Neeloy

Permalink

https://hdl.handle.net/2142/124962

Description

Title: HIERARCHICAL SELF-IMITATION LEARNING IN SINGLE-AGENT SPARSEREWARD ENVIRONMENTS
Author(s): Chakraborty, Neeloy
Issue Date: 2021-05-01
Keyword(s): reinforcement learning; sparse/delayed rewards; self-imitation learning; hierarchical learning
Abstract: Reinforcement learning problems with sparse and delayed rewards are challenging to solve because the algorithms explore environments to gain experience from high performing rollouts. Classical methods of encouraging exploration during training such as ϵ-greedy and noise-based exploration are not adequate on their own to explore large state spaces (Fortunato et al., 2018). Self-imitation learning (SIL) has been shown to allow an agent to learn to mimic high performing, long-horizon trajectories, but SIL is heavily reliant on exploration to find such trajectories (Oh et al., 2018). On the other hand, hierarchical learning (HL) may be unstable during training but incorporates noise and failures that effectively explore the environment and may learn tasks with higher sample efficiency (Levy et al., 2019). This thesis presents a single agent reinforcement learning algorithm that combines the effects of SIL and HL – Generative Adversarial Self Imitation Learning + Hierarchical Actor-Critic (GASIL+HAC). GASIL+HAC represents the policy as multiple trainable levels of Deep Deterministic Policy Gradient (DDPG) optimizers from Lillicrap et al., (2016), where the purpose of the higher-level policies is to set waypoints to guide the lower-level policies to receive the highest cumulative return. The highest-level policy of the hierarchy is trained with GASIL on the sparse environment reward to set goals that imitate past wellperforming trajectories, while the lower levels are trained on an artificial reward signal to set intermediate goals and achieve the desired high-level path. We perform experiments in OpenAI’s Multi-Agent Particle Environment in sparse and delayed reward stochastic scenarios to identify benefits and hinderances of GASIL+HAC compared to DDPG, GASIL, and HAC in sample efficiency, generalizability, exploration, and goal reachability. Through these experiments, we find that GASIL+HAC has the potential to increase sample efficiency in stochastic tasks and increase the number of explored states during training. However, there is an inherent increase in instability of training hierarchical methods and SIL-based methods are still highly dependent on exploration to find high-return trajectories. Further experiments over several more seeds must be run to come to a complete conclusion on the effectiveness of the proposed algorithm.
Type of Resource: text
Language: eng

Owning Collections

Senior Theses - Electrical and Computer Engineering PRIMARY

The best of ECE undergraduate research

HIERARCHICAL SELF-IMITATION LEARNING IN SINGLE-AGENT SPARSEREWARD ENVIRONMENTS

Chakraborty, Neeloy

Permalink

Description

Owning Collections

Senior Theses - Electrical and Computer Engineering PRIMARY

Log In