Stochastic MuZero: Planning in Stochastic Environments with a Learned Model
In a paper titled “Planning in Stochastic Environments with a Learned Model” [1], the authors introduce Stochastic MuZero, an extension of the widely successful MuZero algorithm. The main contribution of this work is the ability to learn and plan with stochastic models, addressing a significant limitation in previous methods.
Stochastic MuZero leverages the concept of afterstates, which are hypothetical states of the environment after applying an action but before transitioning to a true state. This allows the model to separate the effect of applying an action and the chance transition given an action. The factored model is trained end-to-end to maintain value equivalence for both state value function and action value function, and a stochastic planning method is applied to the model.
The algorithm consists of a learned stochastic transition model combined with a variant of Monte Carlo tree search (MCTS). It uses a representation function, afterstate dynamics function, dynamics function, prediction function, and afterstate prediction function. The stochastic model is unrolled and trained in an end-to-end fashion similar to MuZero, optimizing two losses: a MuZero loss and a chance loss for learning the stochastic dynamics of the model.
Stochastic MuZero extends the MCTS algorithm by introducing chance nodes and chance values to the search. The chance and decision nodes are interleaved along the depth of the tree, with each chance node corresponding to a latent afterstate. The algorithm outperformed existing methods in stochastic environments like the puzzle game 2048 and the classic two-player game of backgammon while maintaining superhuman performance in the deterministic board game of Go.
In conclusion, Stochastic MuZero is a novel algorithm that learns and plans with stochastic models, addressing a major limitation in previous methods. It achieves state-of-the-art performance in various stochastic environments while retaining the impressive results of standard MuZero in deterministic settings.
Reference: [1] I. Antonoglou, J. Schrittwieser, S. Ozair, T. K. Hubert, and D. Silver, “Planning in Stochastic Environments with a Learned Model,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=X6D9bAHhBQ1
Comments