Note that Carle’s Game is now accepting submissions until the IEEE CoG competitions session on 2021 Augst 20. Submit your experiment before then and I’ll do my best to mention your contribution in the session, and, (for the first 10 participants (but face it, if you enter you’ll be one of them)) I’ll send a free t-shirt to an address you provide that Threadless ships to.

Rewarding Play and Exploration

Play and exploration seem to be one the best ways for real-world, embodied agents like humans and animals learn. In reinforcement learning, formulating a task in a self-play framework is a tried and true way to generate impressive results in board games (Deepmind’s AlphaZero) simulated tag (OpenAI’s tag agents), StarCraft II (AlphaStar), and Dota II (OpenAI 5). DeepMind’s recently announced XLand, a multi-task “galaxy of games” is a substantial bet on the promise of play. Although XLand is a promising step toward learning based on play and open-endedness, the environment still provides its own gamification, i.e. each game in XLand has a specified reward based on a pre-determined objective. What about learning in a game that has no (gamified) rules, where defining reward is part of the challenge? This is the sort of exploration I set out to investigate (along with your help, hopefully) with the Carle’s Game challenge and the CARLE learning environment, and today we’ll take a look at a new pair of reward proxy wrappers based on the twin ideas of prediction and surprise.

I’ll just speak from personal experience when I say that one of the principal ways yours truly interact with and learns from the environment is by prediction. By making a guess about what will happen next based on my current observations (and perhaps a perturbation or two), I can form a hypothesis and learn about its (lack of) validity by watching what happens next. I suspect many other humans have a similar strategy ;)

In John Conway’s Game of Life, one of the many Life-like cellular automata we can experiment with in CARLE, this might look something like the glider animation in Figure 1.

Figure 1: Glider in Conway's Life.

The reward curve on the right is generated by a reward wrapper called PredictionBonus. This reward proxy is meant to capture the predictability of a pattern sequence, and so the reward is based on a convolutional model’s ability to predict the current state of the cellular automata universe based on its state some number of steps in the past (in this case, 5). We can see that as it begins to gain prediction ability for this period 2 glider, the reward increases and eventually saturates. Now, if we add a perturbation in the form of the acorn Methuselah pattern, we’ll see some changes in the prediction reward (Figure 2).

Figure 2: Glider + Methuselah acorn pattern in Conway's Life. This animation is based on every 5th frame.

In Figure 2 the introduction of this long-lived, chaotic pattern reduces the ability of the model used by PredictionBonus to predict what the CA grid will look like based on past frames. We can probably relate to this, as the pattern sequence mostly looks like a mess, and although it would not be technically difficult to go through the grid cell-by-cell to predict the next frame at each step, it would be tedious. Eventually the pattern settles down into a field of still lifes and oscillators, becoming very predictable and generating a high, stable prediction bonus.

Figure 3: Prediction bonus for a field of still lifes and oscillators.

From casual inspection of just a few frames of Figure 3, we can observe that the grid only has a few different states (2) that it oscillates between. This makes for an easy prediction problem, and the prediction bonus reward is correspondingly higher than during the chaotic growth period. If we remove all the still lifes and oscillators and put in another glider, however, we’ll see that the prediction bonus is even higher.

Figure 4: Prediction bonus for a single glider, again.

I propose two perspectives for explaining why the single glider produces a higher prediction bonus than a field of still lifes and oscillators. The first is an anthropomorphic view, and doesn’t reflect the workings of the prediction model: it’s much easier to take a pen and paper and reproduce each step of the glider than the still life/oscillator field because there are far fewer dynamic cells to keep track of. It might be easy to draw out the next steps for any of the patterns individually, but reproducing future steps of the entire field together would be a (mostly memory-based) challenge. A more technical perspective, more reflective of the actual way the prediction model works, is that it’s easy to predict a spread of empty cells, as the will remain inactive with values of 0.0. Even an untrained model (with no biases) would have perfect prediction power for a field with all inactive cells. Cells on or near active cells, however, will probably be off by a bit even when the cell value would round up or down to the correct 0.0 or 1.0 value. The current implementation of PredictionBonus uses mean squared error loss, and doesn’t apply a decision threshold to the values used for calculating reward (this might be a good feature to add).

Surprise!

Of course, it might seem counterintuitive that, in the examples above, an agent would get maximal reward from just watching the same glider pattern indefinitely traveling around the toroid that is the CA universe, always taking the same two forms. Most humans I’ve heard of would eventually get bored of this. The flip side of a reward based on pattern predictability is one based on surprise. If we want to analogize with the human experience, seeking surprise drives exploration while kudos for predictability drives hypothesis testing. For that I’ve made a mirrored version of PredictionBonus, SurpriseBonus. We can see how the surprise bonus responds to glider and acorn patterns, similar to the above examples, in Figures 5 and 6.

Figure 5: Surprise bonus for a single glider.

Figure 6: Surprise bonus for a glider and acorn Methuselah pattern.

Perhaps some combination of predictability and surprise would be the best approach, e.g. the reward proxy wrappers could be used along with decaying or growing weights to encourage early exploration and later experimentation. We’ll leave that up for future experiments for now.

There are a few other details that may be useful for experimenters looking to work with prediction and surprise wrappers. Next-state prediction rewards are known for being vulnerability to what’s known as “the noisy TV problem.” This is the phenomenon by which any source of unpredictabilty in the environment, usually some sort of stochasticity, can become an irresistible draw for agents under a prediction-based reward. While Life-like cellular automata are fully deterministic and thus predictable, in practice a chaotic scene will produce higher prediction losses and higher rewards when SurpriseBonus is used. Another phenomenon occurs where the SurpriseBonus generates higher rewards for fields with more activations, even when they are predictable or entirely static. In Figure 7 the surprise bonus curve remains high for quite a while after the pattern has become static. The prediction model used in Figure 7 has a batch size of 16, and I noticed that for a batch size of 2 the reward never drops off. This is useful to keep in mind when setting up a surprise of prediction-based experiment, as the wrong hyperparameters used in the prediction model could lead to strange behavior like learning to stare endlessly at a static pattern, or seeking out noisy, mostly random patterns.

Figure 7: Surprise bonus for a coral growth pattern. Reward remains high for an extended period even after the pattern becomes static. This animation is shown at 5X speed.

Running Experiments

I hope the discussion above has piqued your interest in testing predictions, exploring for surprises, and running experiments and contributing to the Carle’s Game challenge. If you wanted to get started with an experiment based on SurpriseBonus, you might try something similar to the command below (for commit c691c4b8 of Carle’s Game and e1cd60a3 of CARLE](https://github.com/rivesunder/carle)):

python -m game_of_carle.experiment -mg  128  -ms  256  -p  32  -sm  1  -v  1  -d  cuda:1  -dim  128  -s 13 42 1337 -a  ConvGRNN -w SurpriseBonus -tr  B3/S23  -vr  B3/S23  -tag  _convgrnn_parsimony_moving

That’s and experiment with a max generations of 128, max steps per run of 256, population size of 32, selection mode of 1 (tournament selection), vectorization of 1, cuda device at index 1 (cuda:1), environment dimensions of 128 by 128, random seeds of 13, 42, and 1337, convolutional gated recurrent neural network agent (ConvGRNN), SurpriseBonus reward wrapper, the B3/S23 ruleset for training and validation (Conway’s Life), and a convenience tag of “_convgrnn_parsimony_moving”. For more information about the challenge and the contest, checkout the README. Remember that to be a part of the Carle’s Game contest as part of the IEEE Conference on Games, you can submit a contribution up until the IEEE CoG competitions session on 2021 Augst 20. Submit your experiment before the session and I’ll do my best to mention your contribution in my 5-minute talk, and, (for the first 10 participants (but face it, if you enter you’ll be one of them)) I’ll send a free t-shirt to an address you provide that Threadless ships to.