Reinforcement Learning in Classic Games
Contents
Reinforcement Learning in Classic Games¶
Warning
Please note that this notebook is still work in progress, hence some sections could behave unexpectedly, or provide incorrect information.
Why are classic games such a meaningful area to apply Reinforcement Learning to? The main point is that these games usually have quite simple rules, but very deep concepts. They also provide for a meaningful IQ test, and encapsulate real world issues. This makes it a really interesting field for Reinforcement Learning.
Game Theory¶
In order to get a sense of optimality in multiplayer games, the field of Game Theory is often used. The aim is to find the optimal policy \(\pi^i\) for the \(i\)-th player. If all other players fix their policies (\(\pi^{-i}\)), the best response \(\pi_*^i(\pi^{-i})\) is the optimal policy against those policies.
There is a problem with this, because it basically learns the optimal policy for one fixed policy of the opponents. Instead, the goal is to find the optimal policy against all policies opponents can have. The most well-known solution to this is referred to as the Nash equilibrium. This is a joint policy for all players.
Here, every player’s policy is a best response. This means no player would choose to deviate from this equilibrium.
Best Response is a solution of a single-agent RL problem.
Other players become part of the environment, and the game is reduced to an MDP. The best response is then an optimal policy of this MDP.
Nash equilibrium is a fixed-point of self-play RL
Experience is generated by playing games between agents \(a_1 \sim \pi^1, a_2 \sim \pi^2, ...\). Each agent learns the best response to other players. One player’s policy determines another player’s environment. This means all players are adapting to each other.
To keep things relatively simple, this chapter restricts itself to two player, zero-sum games. Zero-sum games have equal and opposite rewards \(R^1 + R^2 = 0\). Methods for finding Nash equilibria in these games are discussed.
Minimax Search¶
In these types of games, the value function defines the expected total reward given joint policies \((\pi^1, \pi^2)\)
A minimax value function maximizes player 1’s expected return while minimizing the one from player 2. The optimal value function becomes
A minimax policy \((\pi^1, \pi^2)\) is a joint policy which achieves the minimax values. For two-player zero-sum games, there exists a unique minimax value function. In this case, a minimax policy is a Nash equilibrium.
There is a problem though. The search tree grown exponentially at every level. This means it is impractical to search all the way to the end of the game. A simple solution to this problem uses a value function approximation \(v(s, w) \approx v_\pi(s)\) at the leaf nodes of the search tree. Then, it is possible to run minimax to some fixed depth with respect to the leaf values.
Self-Play Reinforcement Learning¶
Similar to this fixed function approximation in the last section, it is possible to apply value-based RL algorithms to games of self-play. Any algorithm like \(MC, TD(0), TD(\lambda), ...\) can be used. For deterministic games, it is sufficient to estimate \(v_*(s)\). The idea uses the following short proof
Actions are selected by min- or maximizing the value of the next state
This improves the joint policy for both players.
Combining RL and Minimax Search¶
The ideas of both these methods can be combined. This section focuses on some ideas to combine both.
Simple TD¶
As previously seen, the idea of TD is to update the value function towards the successor value. The idea of Simple TD is to first learn the value function in TD learning. After learning, this value function can be used in minimax search (without any learning).
An important question arises when using this approach: can the value function be learning directly from minimax search values? This could increase performance more quickly, since it can be hard to find a terminal state without search.
TD Root¶
The idea of TD Root is to update the value towards the successor search value. Imagine the tree is at root node \(S_t\). After taking an action, \(S_{t+1}\) is explored next. TD would update the value of \(S_t\) towards the value of \(S_{t+1}\). The value of \(S_{t+1}\) equals the value of the minimax search backup, so \(v(S_{t+1}, w) = v(l(S_{t+1}), w)\), where \(l(s)\) is the leaf node achieving minimax value from \(s\).
This was actually the first ever TD learning algorithm applied to a checkers program that learned by self-play. It was able to defeat an amateur human player. This algorithm was then used and extended to perform better.
TD Leaf¶
One idea is to think about the whole search. Instead of updating the value of \(S_t\), the value of the leaf \(l(S_t)\) should be updated. This is the case, because that is the node that contributed to the minimax value \(v(S_t)\). This results in the back-up \(v(l(S_t), w) \leftarrow v(l(S_{t+1}), w)\), instead of \(v(S_t, w) \leftarrow v(l(S_{t+1}), w)\) as seen in TD Root.
Knightcap used TD leaf in chess. It started by only knowing standard piece values, and learned weights using TD leaf. The search used \(\alpha\beta\)-pruning to speed up the minimax search. It ended up achieving master-level play after a small number of games. However, it was not effective in self-play or without starting from good weights. In checkers however, a similar algorithm called Chinook was able to play at superhuman level.
TreeStrap¶
There is still something to optimize. The previously algorithms use the values obtained from the minimax search only once. TreeStrap updates search values towards deeper search values. Minimax search values are computed at all nodes in the search. The backup \(\forall s \in nodes(S_t): v(s, w) \leftarrow v(l(s), w)\) is performed.
Meep, a chess algorithm using TreeStrap, was able to beat 13/15 international masters starting from random initial weights. It is both effective in self-play and random initial weights.
Simulation-Based Search¶
As you have seen now, self-play reinforcement learning can replace search. Next to the idea of minimax, it is also possible to simulate games of self-play from root state \(S_t\), and then apply RL to this simulated experience.
Previously, a concept using Monte-Carlo Control called MCTS was discussed. The most effective variant of this search uses the UCT-algorithm to guide exploration/exploitation in every node. It actually turns out self-play UCT converges on minimax values in zero-sum, 2-player games.
Reinforcement Learning in Imperfect-Information Games¶
In imperfect information games, players have different information states and therefore separate search trees. In this tree, there is one node for each information state. These nodes summarize what a player knows (e.g. the cards they have seen).
There may be many real states that share the same information state, but there may also be aggregate states (e.g. those with a similar value).
There are multiple ways to solve for these information-state game trees.
Iterative forward-search methods
An example of this is Counterfactual Regret Minimization
Self-play Reinforcement learning
Here, a variant of MCTS, called Smooth UCT is discussed briefly. Applying the method to the game of Poker resulted
in 3 silver medals in two- and three-player Poker. It outperformed massive-scale forward-search agents.
Smooth UCT¶
Using Monte-Carlo Tree Search in fully observable games has shown promising. However, improvements must be made to let it perform well in imperfect information games. In Poker for example, naive MCTS diverges, since its approach is just not well-fitted for these types of games.
However, a variant of UCT, inspired by a game-theoretic concept called Fictitious Player was implemented. The idea is that the agent learns against and to respond to opponents’ average behavior. This gives a much robust approach.
The average strategy can be extracted from nodes’ action counts, \(\pi_{avg}(a | s) = \frac{N(s, a)}{N(s)}\). Then, at each node, pick actions according to
Empirically, in variants of Poker, smooth UCT converged to the Nash equilibrium while naive MCTS diverged.