How Do Humans Handle the Dilemma of Exploration and Exploitation in Sequential Decision Making?

8th International Conference on Bio-inspired Information and Communications Technologies (formerly BIONETICS)
Naoya Namiki1, Kuratomo Oyo1, Tatsuji Takahashi1
1: Tokyo Denki University

    In an uncertain environment, decision-making meets two opposing demands. One is to explore new information, while the other is to exploit already acquired information. The opposition is long called the exploration-exploitation dilemma. In brain science, it is known that human brain estimates options comparatively, and the average behavior correlates to the Softmax action selection rule. Softmax randomly chooses options with the selection probability that is a monotonous function of the estimated value. However, it needs a kind of pseudo-random number generator in human’s mind. In cognitive psychology, it is indicated that recognition and generation of random sequence by human are quite biased, generally very unfaithful. Then, is it possible that humans adopt the Softmax policy while they are that bad at generating and recognizing random numbers? In this study, we analyzed how humans behave in face of the exploration-exploitation dilemma through experiments of the N-armed bandit problems and compared some policies commonly used in reinforcement learning modeling, from a viewpoint of whether humans really choose options randomly.