Notes from David Silver’s RL lecture series
2. Markov Decision Processes
Markov processes
a.k.a. Markov chain.
\[\mathcal{P}_{ss'} = \mathbb{P} [ S_{t+1} = s' \mid S_t = s ]\]Markov reward processes
A Markov process with value judgements
We get reward, \(\mathcal{R}_{s} = \mathbb{E} [ R_{t+1} \mid S_t = s ]\), when we leave state \(s\).
The total return from leaving \(s\) is: \(\; G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2R_{t+3} + \dots\)
The state-value function, the long-term value of being in \(s\), is:
\[v(s) = \mathbb{E}[ G_t \mid S_t = s ]\]The Bellman expectation equation for MRPs is (if \(v\) is a valid state function):
\[\begin{aligned} v(s) &= \mathbb{E}[ R_{t+1} + \gamma v(S_{t+1}) \mid S_t = s ] \\ &= \mathcal{R}_{s} + \gamma \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'} \, v(s') \end{aligned}\] \[v = \mathcal{R} + \gamma \mathcal{P} v\]We can evaluate \(v\) directly with \(v = (I - \gamma \mathcal{P})^{-1} \mathcal{R}\)
Markov decision processes
A Markov reward process with decisions
\[\mathcal{P^a_{ss'}} = \mathbb{P}[ S_{t+1} = s' \mid S_t = s, A_t = a ]\] \[\mathcal{R}^a_{s} = \mathbb{E} [ R_{t+1} \mid S_t = s, A_t = a ]\]A policy \(\pi\) is defined as: \(\; \pi(a|s) = \mathbb{P}[A_t = a \mid S_t = s]\)
A Markov decision process can be flattened into a Markov reward process:
\[\mathcal{P}^{\pi}_{ss'} = \sum_{a \in \mathcal{A}} \pi (a|s) \mathcal{P}^a_{ss'}\] \[\mathcal{R}^{\pi}_s = \sum_{a \in \mathcal{A}} \pi (a|s) \mathcal{R}^a_s\]The state-value function:
\[v_{\pi}(s) = \mathbb{E}_{\pi}[G_t \mid S_t = s]\]The action-value function:
\[q_{\pi}(s, a) = \mathbb{E}_{\pi}[G_t \mid S_t = s, A_t = a]\]Bellman again:
\[v_{\pi}(s) = \sum_{a \in \mathcal{A}} \pi(a|s) q_{\pi}(s,a)\] \[q_{\pi}(s,a) = \mathcal{R}^a_s + \gamma \sum_{s' \in S} \mathcal{P}^a_{ss'} v_{\pi}(s')\] \[v_{\pi}(s) = \sum_{a \in \mathcal{A}} \pi(a|s) \left( \mathcal{R}^a_s + \gamma \sum_{s' \in S} \mathcal{P}^a_{ss'} v_{\pi}(s') \right)\] \[q_{\pi}(s,a) = \mathcal{R}^a_s + \gamma \sum_{s' \in S} \mathcal{P}^a_{ss'} \sum_{a' \in \mathcal{A}} \pi(a'|s') q_{\pi}(s',a')\]The optimal value functions are simply:
\[v_{*}(s) = \max_{\pi} v_{\pi}(s)\] \[q_{*}(s, a) = \max_{\pi} q_{\pi}(s, a)\]There is a partial ordering for policies:
\[\pi \geq \pi' \iff v_{\pi}(s) \geq v_{\pi'}(s), \forall s\]There is a theorem that says there is at least one deterministic policy \(v_*\) where:
\[v_* \geq v_{\pi}, \forall \pi\]The Bellman Optimality Equations:
\[v_*(s) = \max_a q_*(s, a)\] \[q_*(s,a) = \mathcal{R}^a_s + \gamma \sum_{s' \in S} \mathcal{P}^a_{ss'} v_*(s')\] \[v_*(s) = \max_a \left( \mathcal{R}^a_s + \gamma \sum_{s' \in S} \mathcal{P}^a_{ss'} v_*(s') \right)\] \[q_*(s,a) = \mathcal{R}^a_s + \gamma \sum_{s' \in S} \mathcal{P}^a_{ss'} \max_{a'} q_*(s',a')\]3. Planning by Dynamic Programming
An optimization method for sequential problems, given perfect knowledge of the environment
Policy evaluation (prediction, using the Bellman expectation equation)
How much return from a policy? What is the \(v_{\pi}\)?
We repeat this step on the flattened MDP:
\[\vec{v} \gets \vec{\mathcal{R}} + \gamma \vec{\mathcal{P} v}\]Policy iteration (solving an MDP)
Finding the best policy. What is the \(v_*\)?
We repeat these steps:
\[v_{\pi} \gets evaluate(\pi) \qquad \text{iteratively}\] \[\pi \gets greedy(v_{\pi})\]Here is the greedy step:
\[\pi'(s) \gets \mathop{\mathrm{argmax}}_{a \in \mathcal{A}} \; q_{\pi}(s,a)\]Value iteration (solving an MDP)
Applying the Bellman equation without an explicit policy
In the policy iteration (above), the \(evaluate(\pi)\) stage contained an inner iteration of its own. If we apply that inner iteration just once, we get value iteration.
Value iteration just repeats this step:
\[v \gets \max_a \left( \vec{\mathcal{R}}^a + \gamma \vec{\mathcal{P}^a v} \right)\]4. Model-Free Prediction (no actions yet)
Evaluating an MDP without a model
Monte-Carlo Learning (unbiased)
\[v_{\pi} = \mathbb{E}_{\pi} [ G_t \mid S_t = s]\]First-visit MC policy evaluation only considers the first visit to a state in a given episode.
Every-visit MC policy evaluation only considers all visits.
Means can be computed incrementally:
\[\mu_k = \mu_{k-1} + \frac{1}{k} (x_k - \mu_{k-1})\] \[V(S_t) \; \text{ += } \; \frac{1}{N(S_t)} (G_t - V({S_t}))\]For non-stationary problems or evolving policies, this is better:
\[V(S_t) \; \text{ += } \; \alpha (G_t - V({S_t}))\]Temporal-Difference Learning (for model-free prediction) - TD(0)
Updates a guess towards a guess. Lower variance and faster, but biased. TD(0) can converge incorrectly when using function approximation. More sensitive to the initial value.
\[V(S_t) \; \text{ += } \; \alpha ( \underbrace{ \underbrace{ R_{t+1} + \gamma V(S_{t+1})} _{\text{TD Target}} - V({S_t}) } _{\text{TD error}} )\]Or more colloquially:
\[\text{Nudge } \, V(S_t) \, \text{ towards, the target } \, R_{t+1} + \gamma V(S_{t+1})\]A,0,B,0
B,1
B,1
B,0
MC says V(A) = 1, but TD says V(A) = 0.5.
\[\begin{array}{c|cc} & samples & fullwidth \\ \hline bootstraps & TD & DP \\ & MC & exhaustive \end{array}\]TD(λ)
A spectrum of sampling methods
One-step TD learning (n=1), TD(0):
\[G^{(1)}_t = R_{t+1} + \gamma V(S_{t+1})\]Two-step TD learning (n=2):
\[G^{(2)}_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 V(S_{t+2})\]N-step TD learning:
\[\text{Nudge } \, V(S_t) \, \text{ towards } \, G^{(n)}_t\]TD(λ) learning:
\[\text{Nudge } \, V(S_t) \, \text{ towards } \, G^{\lambda}_t, \text{where}\] \[G^{\lambda}_t = (1 - \lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G^{(n)}_t\]The geometric decay of the \(G^{(n)}_t\) coefficients allows for these eligibility traces:
\[E_0(s) = 0\] \[E_t(s) = \gamma \lambda E_{t-1}(s) + \boldsymbol{1} (S_t = s)\]which gives us these steps:
\[\delta_t \gets R_{t+1} + \gamma V(S_{t+1}) - V(S_t)\] \[V(s) \gets V(s) + \alpha \delta_t E_t(s)\]5. Model-free control
Finding the optimal value functions without a model
On-Policy MC control
On-Policy: learning about π from experience sampled from π
There are two problems:
- It needs the model because \(\pi'(s) = \mathrm{argmax} (\mathcal{R^a_s} + \mathcal{P^a_{ss'}} V(s'))\)
- It doesn’t explore
For the first problem, we use $q$ instead of $v$.
For the second problem, we use the $\varepsilon$-greedy policy:
\[\pi(a|s) = \begin{cases} \epsilon / m + 1 - \epsilon & \text{if } \; \displaystyle{\mathop{argmax}_{a \in \mathcal{A}}} (Q(s, a)) = a^* \\ \epsilon / m & \text{otherwise} \end{cases}\]To speed things up, we can change from:
loop(evaluate q with many MC episodes of running π; the ε-greedy step of making a new π)
to:
loop(evaluate q with just one MC episode of running π; the ε-greedy step of making a new π)
To make it simpler, the π is implicit and is always calculated from q.
On-Policy TD Learning
Use TD instead of MC. Keep nudging q by updating a guess towards a (better) guess, at every step, not just every episode.
It is known as Sarsa, and a step is like this:
\[\text{Nudge } \, Q(S,A) \, \text{ towards } \, R + \gamma Q(S',A')\]where we are in a state S and considering taking the action A, and we want to know the R and then the value of the next action we would take.
Instead of this 1-step SARSA, we could have 2-step etc., or a λ version (including the eligibility traces).
Off-policy learning
Off-Policy: learning about π from experience sampled from some behaviour policy μ
Importance sampling: estimating the expectation of a different distribution, by multiplying by the ratio, and that corrects for the change between your distributions.
\[\begin{aligned} \mathbb{E}_{X \sim P} [f(X)] &= \sum P(X)f(X) \\ &= \sum Q(X) \frac{P(X)}{Q(X)} f(X) \\ &= \mathbb{E}_{X \sim Q} \left[ \frac{P(X)}{Q(X)} f(X) \right] \end{aligned}\]In MC, we could compute a \(G_t^{\pi / \mu}\) using lots of ratios, but it ends up with very high variance.
Therefore, we have to use TD for off-policy learning, where we nudge \(V(S_t)\) towards this target:
\[\frac{\pi (A_t|S_t)}{\mu (A_t|S_t)} (R_{t+1} + \gamma V(S_{t+1}))\]Alternatively, we use Q-learning (which is specific to TD(0)):
$\mu$ is the behaviour policy for selecting our next action $A_{t+1}$.
But we are also going to consider an alternative successor action $A’$ we might have taken had we been following our target policy $\pi$.
\[A_{t+1} \sim \mu (\cdot | S_t)\] \[A' \sim \pi (\cdot | S_t)\]We nudge $Q(S_t, A_t)$ towards this target:
\[R_{t+1} + \gamma Q(S_{t+1}, A')\]The common case is where $\pi$ is greedy w.r.t. $Q(s,a)$ and $\mu$ is ε-greedy w.r.t. $Q(s,a)$.
In other words, the $A’$ is taken greedily and the $A_{t+1}$ is sampled ε-greedily.
\[\begin{align} R_{t+1} + \gamma Q(S_{t+1}, A') &= R_{t+1} + \gamma Q(S_{t+1}, \mathrm{argmax} \; Q(S_{t+1}, a')) \\ &= R_{t+1} + \mathrm{max} \; \gamma Q(S_{t+1}, a') \end{align}\]