Model-Free Control

Warning

Please note that this notebook is still work in progress, hence some sections could behave unexpectedly, or provide incorrect information.

This chapter several methods for Model-Free control will be discussed. Algorithms like Monte-Carlo and Temporal-Difference learning are often used for prediction, but how can these concepts be used for control? There are two different ways of learning when sampling, these are

  • On-policy learning

    The goal is to learn about policy π from experience that is being sampled by π

  • Off-policy learning

    The goal is to learn about policy π from experience that is being sampled by a policy μ, μπ.

From generalized policy iteration, any valid policy evaluation algorithm can be followed by any valid policy improvement algorithm and iterated in order to find the optimal policy. The first question that should come up in your mind is “Can we just use the algorithms from the previous chapter for the policy evaluation step?”. Initially, there are two problems with this.

  1. Greedy policy improvement over V(s) requires a model of the MDP, since π(s)=argmaxaARsa+PssaV(s). We do not have access to the rewards and the transition probabilities.

    We know this is equal to π(s)=argmaxaAQ(s,a). So learning the q-value function instead of the value function will be model-free.

  2. Being greedy using sampling methods does not ensure we cover the entire state space.

    Exploration vs. exploitation is a whole problem on its own in Reinforcement Learning, that will be discussed in a later chapter. For now we can consider the easiest way to ensure continual exploration. Instead of being greedy, lets be ϵ-greedy. The policy becomes the following

    π(a|s)={ϵ/m+1ϵ,a=argmaxaAQ(s,a)ϵ/m,otherwise.

    This means that there is a 1ϵ probability to choose the greedy action and an ϵ probability to choose any other random action, with m being the number of actions.

    Now, all that is left to be done is prove any ϵ-greedy policy π with respect to qπ actually is an improvement to π. The proof is rather simple:

    qπ(s,π(s))=aAπ(a|s)qπ(s,a)=ϵ/maAqπ(s,a)+(1ϵ)maxaAqπ(s,a)ϵ/maAqπ(s,a)+(1ϵ)aAπ(a|s)ϵ/m1ϵqπ(s,a)=aAπ(a|s)qπ(s,a)=vπ(s)

    Then, from policy improvement theorem, vπ(s)vπ(s).

We have now tackled the problems that prevented us to use the idea of generalized policy iteration. The methods we have previously seen can now be applied to solve the problem.

On-Policy methods

Last chapter Monte-Carlo and TD-Learning were discussed for policy evaluation. This section will show how to apply these algorithms for control tasks.

Monte-Carlo Control

../../_images/MC-Control.png

Fig. 5 Monte-Carlo Control algorithm

The idea of Fig. 5 is similar to policy iteration. In order to speed up convergence, it is not necessary to evaluate the policy until qπ is obtained. A better approach is to perform Monte-Carlo policy evaluation until the end of the episode, and then perform an ϵ-greedy policy improvement step with respect to the computed q-values.

There is a theorem called Greedy in the Limit with Infinite Exploration (GLIE), which ensures you converge to the optimal policy (which is always greedy). The following two rules must apply

  1. limkNk(s,a)=

  2. limkπk(a|s)=1(a=argmaxaAQk(s,a))

Here, the ϵ-greedy policy would reduce to greedy when there is infinite experience. For example, ϵ-greedy is GLIE if ϵ reduces to zero at the rate ϵk=1k. Let’s construct an algorithm based on this knowledge.

Algorithm 6 One iteration of GLIE Monte-Carlo Control

for t0,...,Tt \Leftarrow 0, ..., T do

At,Rt+1,St+1πA_t, R_{t+1}, S_{t+1} \sim \pi

N(St,At)N(St,At)+1N(S_t, A_t) \Leftarrow N(S_t, A_t) + 1

Q(St,At)Q(St,At)+1N(St,At)(GtQ(St,At))Q(S_t, A_t) \Leftarrow Q(S_t, A_t) + \frac{1}{N(S_t, A_t)} (G_t - Q(S_t, A_t))

end for

ϵ1/k,πϵ\epsilon \Leftarrow 1/k, \pi \Leftarrow \epsilon-greedy(QQ)

return Q,πQ, \pi

TD-Control (SARSA)

TD-Learning has several advantages over MC such as a lower variance, online updating, and being able to learn from incomplete sequences. It would be a natural idea to use TD instead of MC in the control loop that was previously presented. This method is referred to as SARSA. This yields the following algorithm

Algorithm 7 One iteration of SARSA

Require: πϵ\pi \Leftarrow \epsilon-greedy(QQ)

for t0,...,Tt \Leftarrow 0, ..., T do

At,Rt+1,St+1,At+1πA_t, R_{t+1}, S_{t+1}, A_{t+1} \sim \pi

Q(St,At)Q(St,At)+α[Rt+1+γQ(St+1,At+1)Q(St,At)]Q(S_t, A_t) \Leftarrow Q(S_t, A_t) + \alpha \left[R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) \right]

end for

ϵ1/k,πϵ\epsilon \Leftarrow 1/k, \pi \Leftarrow \epsilon-greedy(QQ)

return Q,πQ, \pi

SARSA converges to the optimal action-value function under the following conditions:

  1. GLIE sequence of policies πt(a|s)

  2. Robbins-Monro sequence of step-sizes αt (t=1αt= and t=1αt2<)

In practice however, this is only very rarely taken into account (like with a constraint learning rate α). It does not seem to pose that much of a threat.

Just like n-step TD and TD(λ), there exists n-step SARSA and SARSA(λ). SARSA(λ) also has the same forward- and backward-view algorithms. These are the exact same concepts. The only difference is SARSA is for action-values. For this reason, I will just give you the formulas for it. The intuition is the same as in the previous chapter.

  1. n-step SARSA

    Q(St,At)=Q(St,At)+α[qt(n)Q(St,At)]

    qt(n)=Rt+1+γRt+2+...+γn1Rt+n+γnQ(St+n,At+n)

  2. (Forward View) SARSA(λ)

    Q(St,At)=Q(St,At)+α[qtλQ(St,At)]

    qtλ=(1λ)n=1λn1qt(n)

Algorithm 8 Iteration of (backward view) SARSA(λ\lambda)

Require: πϵ\pi \Leftarrow \epsilon-greedy(QQ), QQ

E(s,a)=0E(s, a) = 0

Initialize S0,A0S_0, A_0

for t0,...,Tt \Leftarrow 0, ..., T do

Rt+1,St+1,At+1πR_{t+1}, S_{t+1}, A_{t+1} \sim \pi

δtRt+1+γQ(St+1,At+1)Q(St,At)\delta_t \Leftarrow R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)

E(St,At)E(St,At)+1E(S_t, A_t) \Leftarrow E(S_t, A_t) + 1

for all unique previously occurred (s,a)(s, a) pairs do

Q(s,a)Q(s,a)+αδtE(s,a)Q(s, a) \Leftarrow Q(s, a) + \alpha \delta_t E(s, a)

E(s,a)γλE(s,a)E(s, a) \Leftarrow \gamma \lambda E(s, a)

end for

end for

return VV

The eligibility traces for the backward view algorithm are, just like the value function, now taken into account for all state-action pairs.

../../_images/lambda-role-sarsa.png

Fig. 6 The role of λ in SARSA(λ)

In Fig. 6, you can see the role of the lambda parameter. When receiving reward (in the image only at the end), the combination of the recency and visitation heuristic made into the eligibility trace makes sure there is a decay in propagating the reward backwards. This is all one iteration of the algorithm.

Off-Policy methods

The goal of off-policy learning is to evaluate the target policy π(a|s) to compute vπ(s) or qπ(s,a). All of this happens while following a different policy μ(a|s). This is referred to as the behaviour policy. The most well-known use of this is learning about the optimal policy while following a exploratory policy.

Importance Sampling is a way of estimating the expectation of a different distribution. The underlying idea is the following

EXP[f(X)]=f(X)P(x)=f(X)P(X)Q(X)Q(X)=EXQ[f(X)P(X)Q(X)]

From this derivation, notice that it is possible to estimate the expectation of sampling distribution P while sampling from Q, using a simple division P(X)Q(X).

Importance Sampling for Off-Policy MC / TD

The idea of this simple division can be used in the Monte-Carlo return and TD target. This means we could follow policy μ, while learning policy π by just dividing π by μ for every time-step (as seen in the proof).

Gtπ/μ=π(At|St)μ(At|St)π(At+1|St+1)μ(At+1|St+1)...π(AT|ST)μ(AT|ST)Gt

The value will then be updated towards the corrected return: Vk+1(St)=Vk(St)+α(Gtπ/μVk(St)). There are several downsides to using this method. The first one is that it can not be used if μ=0,π0. More importantly, it can dramatically increase variance. This is of course something to avoid when possible.

For TD, the update rule becomes

Vk+1(St)=Vk(St)+α(π(At|St)μ(At|St)(Rt+1+γVk(St+1))Vk(St))

The benefit to this is that it will have a much lower variance than the MC approach. It means the policies only need to be similar over a single step instead of the whole episode chain.

Q-Learning

Now let’s consider off-policy learning for action-values Q(s,a). The idea of Q-Learning is to choose the next action using a behaviour policy At+1μ(.|St), but an alternative successor action Aπ(.|St) is also considered. Q(St,At) will then be updated towards the value of the alternative action. Qk+1(St,At)=Qk(St,At)+α(Rt+1+γQk(St+1,A)Qk(St,At)).

Now allow both policies to be able to improve. π(St+1)=argmaxaQ(St+1,a) and μ=ϵ-greedy(Q). In this case, the Q-learning target simplifies to the following

Rt+1+γQ(St+1,A)=Rt+1+γQ(St+1,argmaxaQ(St,a))=Rt+1+γmaxaQ(St+1,a)

The algorithm of Q-Learning is then

Algorithm 9 One iteration of Q-Learning

Require: πϵ\pi \Leftarrow \epsilon-greedy(QQ),Q, Q

for t0,...,Tt \Leftarrow 0, ..., T do

At,Rt+1,St+1,At+1πA_t, R_{t+1}, S_{t+1}, A_{t+1} \sim \pi

Q(St,At)Q(St,At)+α[Rt+1+γmaxAt+1Q(St+1,At+1)Q(St,At)]Q(S_t, A_t) \Leftarrow Q(S_t, A_t) + \alpha \left[R_{t+1} + \gamma \max_{A_{t+1}} Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) \right]

end for

return Q,πQ, \pi