Reinforcement Learning in Classic Games

Warning

Please note that this notebook is still work in progress, hence some sections could behave unexpectedly, or provide incorrect information.

Why are classic games such a meaningful area to apply Reinforcement Learning to? The main point is that these games usually have quite simple rules, but very deep concepts. They also provide for a meaningful IQ test, and encapsulate real world issues. This makes it a really interesting field for Reinforcement Learning.

Game Theory

In order to get a sense of optimality in multiplayer games, the field of Game Theory is often used. The aim is to find the optimal policy \(\pi^i\) for the \(i\)-th player. If all other players fix their policies (\(\pi^{-i}\)), the best response \(\pi_*^i(\pi^{-i})\) is the optimal policy against those policies.

There is a problem with this, because it basically learns the optimal policy for one fixed policy of the opponents. Instead, the goal is to find the optimal policy against all policies opponents can have. The most well-known solution to this is referred to as the Nash equilibrium. This is a joint policy for all players.

\[ \pi^i = \pi_*^i(\pi^{-i}) \]

Here, every player’s policy is a best response. This means no player would choose to deviate from this equilibrium.

  • Best Response is a solution of a single-agent RL problem.

    Other players become part of the environment, and the game is reduced to an MDP. The best response is then an optimal policy of this MDP.

  • Nash equilibrium is a fixed-point of self-play RL

    Experience is generated by playing games between agents \(a_1 \sim \pi^1, a_2 \sim \pi^2, ...\). Each agent learns the best response to other players. One player’s policy determines another player’s environment. This means all players are adapting to each other.

To keep things relatively simple, this chapter restricts itself to two player, zero-sum games. Zero-sum games have equal and opposite rewards \(R^1 + R^2 = 0\). Methods for finding Nash equilibria in these games are discussed.

Self-Play Reinforcement Learning

Similar to this fixed function approximation in the last section, it is possible to apply value-based RL algorithms to games of self-play. Any algorithm like \(MC, TD(0), TD(\lambda), ...\) can be used. For deterministic games, it is sufficient to estimate \(v_*(s)\). The idea uses the following short proof

\[\begin{split} q_*(s, a) & = \sum_{s'} P^a_{ss'} v_*(s)\\ q_*(s, a) & = v_*(s')\end{split}\]

Actions are selected by min- or maximizing the value of the next state

\[\begin{split} A_t & = \arg\max_a v_*(S_t') \text{ for the maximizing player}\\ A_t & = \arg\min_a v_*(S_t') \text{ for the minimizing player}\end{split}\]

This improves the joint policy for both players.

Reinforcement Learning in Imperfect-Information Games

In imperfect information games, players have different information states and therefore separate search trees. In this tree, there is one node for each information state. These nodes summarize what a player knows (e.g. the cards they have seen).

There may be many real states that share the same information state, but there may also be aggregate states (e.g. those with a similar value).

There are multiple ways to solve for these information-state game trees.

  • Iterative forward-search methods

    An example of this is Counterfactual Regret Minimization

  • Self-play Reinforcement learning

    Here, a variant of MCTS, called Smooth UCT is discussed briefly. Applying the method to the game of Poker resulted

  • in 3 silver medals in two- and three-player Poker. It outperformed massive-scale forward-search agents.

Smooth UCT

Using Monte-Carlo Tree Search in fully observable games has shown promising. However, improvements must be made to let it perform well in imperfect information games. In Poker for example, naive MCTS diverges, since its approach is just not well-fitted for these types of games.

However, a variant of UCT, inspired by a game-theoretic concept called Fictitious Player was implemented. The idea is that the agent learns against and to respond to opponents’ average behavior. This gives a much robust approach.

The average strategy can be extracted from nodes’ action counts, \(\pi_{avg}(a | s) = \frac{N(s, a)}{N(s)}\). Then, at each node, pick actions according to

\[\begin{split} A \sim \begin{cases} UCT(S), & \text{ with probability } \epsilon\\ \pi_{avg}(.|S), & \text{ with probability } 1 - \epsilon \end{cases} \end{split}\]

Empirically, in variants of Poker, smooth UCT converged to the Nash equilibrium while naive MCTS diverged.