Hierarchical Self-Imitation Learning in Single-Agent Sparse Reward Environments
Chakraborty, Neeloy
This item is only available for download by members of the University of Illinois community. Students, faculty, and staff at the U of I may log in with your NetID and password to view the item. If you are trying to access an Illinois-restricted dissertation or thesis, you can request a copy through your library's Inter-Library Loan office or purchase a copy directly from ProQuest.
Permalink
https://hdl.handle.net/2142/110312
Description
Title
Hierarchical Self-Imitation Learning in Single-Agent Sparse Reward Environments
Author(s)
Chakraborty, Neeloy
Contributor(s)
Driggs-Campbell, Katherine
Issue Date
2021-05
Keyword(s)
reinforcement learning
sparse/delayed rewards
self-imitation learning
hierarchical learning
Abstract
Reinforcement learning problems with sparse and delayed rewards are challenging to solve
because the algorithms explore environments to gain experience from high performing rollouts.
Classical methods of encouraging exploration during training such as epsilon-greedy and noise-based
exploration are not adequate on their own to explore large state spaces (Fortunato et
al., 2018). Self-imitation learning (SIL) has been shown to allow an agent to learn to mimic high
performing, long-horizon trajectories, but SIL is heavily reliant on exploration to find such trajectories
(Oh et al., 2018). On the other hand, hierarchical learning (HL) may be unstable during training
but incorporates noise and failures that effectively explore the environment and may learn tasks
with higher sample efficiency (Levy et al., 2019). This thesis presents a single agent reinforcement
learning algorithm that combines the effects of SIL and HL – Generative Adversarial Self Imitation the desired high-level path. We perform experiments in OpenAI’s Multi-Agent Particle Environment
in sparse and delayed reward stochastic scenarios to identify benefits and hinderances of GASIL
+HAC compared to DDPG, GASIL, and HAC in sample efficiency, generalizability, exploration,
and goal reachability. Through these experiments, we find that GASIL+HAC has the potential to
increase sample efficiency in stochastic tasks and increase the number of explored states during
training. However, there is an inherent increase in instability of training hierarchical methods
and SIL-based methods are still highly dependent on exploration to find high-return trajectories.
Further experiments over several more seeds must be run to come to a complete conclusion on the
effectiveness of the proposed algorithm.
Learning + Hierarchical Actor-Critic (GASIL+HAC). GASIL+HAC represents the policy as multiple
trainable levels of Deep Deterministic Policy Gradient (DDPG) optimizers from Lillicrap et al., (2016),
where the purpose of the higher-level policies is to set waypoints to guide the lower-level policies
to receive the highest cumulative return. The highest-level policy of the hierarchy is trained with
GASIL on the sparse environment reward to set goals that imitate past well-performing trajectories,
while the lower levels are trained on an artificial reward signal to set intermediate goals and achieve
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.