The Reinforcement Learning Problem

Warning

Please note that this notebook is still work in progress, hence some sections could behave unexpectedly, or provide incorrect information.

Reinforcement Learning (RL) is an area of Machine Learning concerned with deciding on a sequence of actions in an unknown environment in order to maximize cumulative reward.

To give an idea of this, imagine you are somewhere in a 2d-maze. At each point, you can either move to the left, right, up or down. The goal is to find your way out of the maze. This corresponds to obtaining positive reward when completing the maze. Using Reinforcement Learning, you can figure out the optimal way to behave in this environment.

The Reinforcement Learning Problem

What makes reinforcement learning different from other machine learning paradigms?

  • There is no supervisor, only a reward signal

  • Feedback is delayed, not instantaneous

  • Time really matters (sequential, non i.i.d. data)

  • Agent’s actions affect the subsequent data it receives

Definition 1

The Reward Hypothesis states that all imaginable goals can be described by the maximization of an expected cumulative reward function.

Rewards function as scalar feedback signals. Reinforcement Learning is based on the Reward Hypothesis, which has been assumed to be true. Some problems can be difficult to solve, since actions can have long-term consequences and reward can be delayed.

Definition 2

A stochastic process \(\{X_t\}_{t \in T}\), with values \(X_0, X_1, \dotsc, X_T\) from \(t = 0\) until \(t = T\) is said to have the Markov Property, if and only if

\[ \mathbb{P}(X_{t+1} | X_t) = \mathbb{P}(X_{t+1} | X_0, ..., X_t) \]

Let \(S_t^a\) and \(S_t^e\) be the state of the agent and the environment at any time \(t\). If the environment is fully-observable, then \(S_t^a = S_t^e\). Reinforcement learning environments can be seen as stochastic processes \(\{S_t\}_{t \in T}\).

For an environment, having the Markov property means that, for all possible next states, the probability of obtaining that future state (\(S_{t+1}\)) solely depends on the current state (\(S_t\)), and not on any previous states (\(S_{t-1}, \dotsc, S_0\)). This property describes that the current state is a sufficient statistic of the future, and history does not matter.

This will prove to be a key property in Reinforcement Learning, and it will be useful in many scenarios in the book. But what happens when the markov property does not hold?

The class of partially-observable environments do not have the Markov property. Here, the agent indirectly observes the environment, meaning it may not have all information that is needed to know what happens next. Now, \(S_t^a \neq S_t^e\). Since history is now important to predict the future, the agent must construct its own state representation \(S_t^a\). For example:

  • Complete history: \(S_t^a = H_t = (O_0, O_1, \dotsc, O_t)\)

  • Beliefs of environment state: \(S_t^a = (\mathbb{P}[S_t^e = s^{(1)}], \dotsc,\mathbb{P}[S_t^e = s^{(n)}])\)

  • Recurrent Neural Network: \(S_t^a = \sigma(S_{t-1}^a W_s + O_t W_o))\)

Here, \(O_t\) is the observed state at time \(t\). These problems are typically much more difficult to solve, since the Markov property cannot be exploited.

Components of a Reinforcement Learning Agent

An RL agent may include one or more of these components:

  • Policy: agent’s behaviour function

  • Value function: how good is each state and/or action

  • Model: agent’s representation of the environment

A policy describes the agent’s behavior. It maps states to actions. You can have deterministic (\(a = \pi(s)\)) and stochastic policies (\(\pi(a | s) = \mathbb{P}(A_t = a | S_t = s)\)). Often, \(\pi\) is used to denote a policy.

A value function is a prediction of future reward of a given state. You can use it to determine if a state is good or bad. This means you can use it to select actions. It can be computed by \(v_\pi(s) = \mathbb{E}_\pi(G_t | S_t = s)\), where \(G_t\) is the return (or discounted cumulative reward). The return is defined as \(G_t = R_1 + \gamma R_2 + \gamma^2 R_3 + ... = \sum_{i=t+1}^\infty\gamma^{i-t-1}R_i\) for some \(\gamma \in [0, 1]\). This gamma is the discount factor, and it influences how much the future impacts return. This is useful, since it is not known if the representation of the environment is perfect. If it is not, it is not good to let the future influence the return as much as more local states. So, it is discounted.

Finally, a model predicts what the environment will do next. We let \(P_{ss'}^a = \mathbb{P}(S_{t+1} = s' | S_t = s, A_t = a)\) and \(R_{s}^a = \mathbb{P}(R_{t+1} | S_t = s, A_t = a)\). \(P\) (Transition model) is the probability of transitioning to a next state given an action, while R is the reward when taking an action in some state.

Table 2 Types of Reinforcement Learning agents

Category

Properties

Value based

No Policy (implicit), Value function

Policy based

Policy, No Value function

Actor Critic

Policy, Value function

Model Free

No Model of the environment

Model based

Model of the environment

RL Agents can be categorized into the categories that are listed in Table 2. These can require different approaches that will be discussed throughout the book.

There are two fundamental problems in sequential decision making.

  • Reinforcement Learning

    • The environment is initially unknown

    • The agent interacts with the environment

    • The agent improves its policy

  • Planning (e.g. deliberation, reasoning, introspection, pondering, thought, search)

    • A model of the environment is known

    • The agent performs computations with its model (without any external interaction)

    • The agent improves its policy

It is important for an agent to make a trade-off between exploration and exploitation as well. Depending on the choice in this trade-off, agents will be more or less flexible and may or may not find better actions to perform.

  • Exploration finds more information about the environment

  • Exploitation exploits known information to maximize reward

Finally, it is possible to differentiate between prediction and control. Prediction is about evaluating the future given a certain policy, while control is about finding the best policy to optimize the future.