Model-Free Control
Contents
Model-Free Control¶
Warning
Please note that this notebook is still work in progress, hence some sections could behave unexpectedly, or provide incorrect information.
This chapter several methods for Model-Free control will be discussed. Algorithms like Monte-Carlo and Temporal-Difference learning are often used for prediction, but how can these concepts be used for control? There are two different ways of learning when sampling, these are
On-policy learning
The goal is to learn about policy
from experience that is being sampled byOff-policy learning
The goal is to learn about policy
from experience that is being sampled by a policy , .
From generalized policy iteration, any valid policy evaluation algorithm can be followed by any valid policy improvement algorithm and iterated in order to find the optimal policy. The first question that should come up in your mind is “Can we just use the algorithms from the previous chapter for the policy evaluation step?”. Initially, there are two problems with this.
Greedy policy improvement over
requires a model of the MDP, since . We do not have access to the rewards and the transition probabilities.We know this is equal to
. So learning the q-value function instead of the value function will be model-free.Being greedy using sampling methods does not ensure we cover the entire state space.
Exploration vs. exploitation is a whole problem on its own in Reinforcement Learning, that will be discussed in a later chapter. For now we can consider the easiest way to ensure continual exploration. Instead of being greedy, lets be
-greedy. The policy becomes the followingThis means that there is a
probability to choose the greedy action and an probability to choose any other random action, with being the number of actions.Now, all that is left to be done is prove any
-greedy policy with respect to actually is an improvement to . The proof is rather simple:Then, from policy improvement theorem,
.
We have now tackled the problems that prevented us to use the idea of generalized policy iteration. The methods we have previously seen can now be applied to solve the problem.
On-Policy methods¶
Last chapter Monte-Carlo and TD-Learning were discussed for policy evaluation. This section will show how to apply these algorithms for control tasks.
Monte-Carlo Control¶

Fig. 5 Monte-Carlo Control algorithm¶
The idea of Fig. 5 is similar to policy iteration. In order to speed up convergence, it is
not necessary to evaluate the policy until
There is a theorem called Greedy in the Limit with Infinite Exploration (GLIE), which ensures you converge to the optimal policy (which is always greedy). The following two rules must apply
Here, the
Algorithm 6 One iteration of GLIE Monte-Carlo Control
for do
end for
-greedy()
return
TD-Control (SARSA)¶
TD-Learning has several advantages over MC such as a lower variance, online updating, and being able to learn from incomplete sequences. It would be a natural idea to use TD instead of MC in the control loop that was previously presented. This method is referred to as SARSA. This yields the following algorithm
Algorithm 7 One iteration of SARSA
Require: -greedy()
for do
end for
-greedy()
return
SARSA converges to the optimal action-value function under the following conditions:
GLIE sequence of policies
Robbins-Monro sequence of step-sizes
( and )
In practice however, this is only very rarely taken into account (like with a constraint learning rate
Just like
-step SARSA(Forward View) SARSA(
)
Algorithm 8 Iteration of (backward view) SARSA()
Require: -greedy(),
Initialize
for do
for all unique previously occurred pairs do
end for
end for
return
The eligibility traces for the backward view algorithm are, just like the value function, now taken into account for all state-action pairs.

Fig. 6 The role of
In Fig. 6, you can see the role of the lambda parameter. When receiving reward (in the image only at the end), the combination of the recency and visitation heuristic made into the eligibility trace makes sure there is a decay in propagating the reward backwards. This is all one iteration of the algorithm.
Off-Policy methods¶
The goal of off-policy learning is to evaluate the target policy
Importance Sampling is a way of estimating the expectation of a different distribution. The underlying idea is the following
From this derivation, notice that it is possible to estimate the expectation of sampling distribution
Importance Sampling for Off-Policy MC / TD¶
The idea of this simple division can be used in the Monte-Carlo return and TD target. This means we could follow policy
The value will then be updated towards the corrected return:
For TD, the update rule becomes
The benefit to this is that it will have a much lower variance than the MC approach. It means the policies only need to be similar over a single step instead of the whole episode chain.
Q-Learning¶
Now let’s consider off-policy learning for action-values
Now allow both policies to be able to improve.
The algorithm of Q-Learning is then
Algorithm 9 One iteration of Q-Learning
Require: -greedy()
for do
end for
return