The uncoordinated spectrum access problem is studied using a multi-player multi-armed bandits framework. We consider a decentralized multi-player stochastic multi-armed bandit model where the players cannot communicate with each other and can observe only their own actions and rewards. Furthermore, the environment may appear differently to different players, i.e., the reward distributions for a given arm may vary across players. Knowledge of time horizon T is not assumed. Under these conditions, we consider two settings - zero and non-zero reward on collision (when more than one player plays the same arm). Under the zero reward on collision setting, we present a policy that achieves expected regret of O(log T) over a time horizon of duration T. While settings with non-zero rewards on collisions and varying reward distributions of arms across players have been considered separately in prior work, a model allowing for both has not been studied previously to the best of our knowledge. With this setup, we present a policy that achieves expected regret of order O(log^{2 + \delta} T) for some 0 < \delta < 1 over a time horizon of duration T.
Use this login method if you
don't
have an
@illinois.edu
email address.
(Oops, I do have one)
IDEALS migrated to a new platform on June 23, 2022. If you created
your account prior to this date, you will have to reset your password
using the forgot-password link below.