Reinforcement Learning [Part 1]

Recap

In the previous post we introduced:

We remarked that states and rewards are environment-related random variables: the agent has no way to interfere with the reward mechanism or modify the state transition resulting as a consequence of one of its actions.
Actions are the only domain entirely under the responsibility of the agent - specifying the probability distribution of $A_t$ conditioned on all the possible values of $S_t, \, A_{t-1}, \, \dots, S_{1}$ for every $t\in\mathbb{N}$ is exactly equivalent to a full specification of the agent behaviour - we shall take a closer look at the issue in this post.

In other words, we are going to formally introduce the concept of policy.

A bit of notation

The history of our system is the sequence up to time $t$, $h_t$, is nothing else than the sequence of the so-far observed states and actions, i.e. $$\begin{equation} h_t:=(s_1, a_1, s_2, a_2, \dots, a_{t-1}, s_t) \end{equation}$$ where $s_i\in\mathcal{S}$ and $a_i\in\mathcal{A}(s_i)$ for all $i\in\{1,\dots, t\}$.
We will use $H_t$ (capital $H$) to denote history as a random variable.

To simplify we shall assume, from now on, that the action space does not depend on the current state of the system - it does not make a substantial difference, but it allows the use of a cleaner notation.
This implies that $h_t$ belongs to $$\begin{align} \mathcal{H}_1 &:= \mathcal{S} \ \mathcal{H}_t &:= \mathcal{S}\times \mathcal{A} \times \dots \times \mathcal{S} = \mathcal{S}\times \prod_{t-1\text{ times}}(\mathcal{A}\times\mathcal{S}) \quad\quad \text{if }t\in{2, 3, \dots} \end{align}$$ or, using a little bit of recursion, $$\begin{equation}\begin{split} \mathcal{H}_1 &= \mathcal{S} \ \mathcal{H}_t &=\mathcal{H}_{t-1}\times \mathcal{A}\times \mathcal{S} \quad\quad \text{if }t\in{2, 3, \dots} \end{split}\end{equation}$$

We can thus denote the history of our system up to time $t$ as $$\begin{equation} h_t = (h_{t-1}, a_{t-1}, s_t) \end{equation}$$ This will turn out to be a useful notation in upcoming computations.

Decision rules

As we anticipated in the recap, we want to come up with a flexible way to specify (and study) $$\begin{equation} \mathbb{P}(A_t=a_t ,|, S_t=s_t, A_{t-1}=a_{t-1}, \dots, S_1=s_1) \end{equation}$$ Using our brand-new history notation we can rewrite that expression as $$\begin{equation} \mathbb{P}(A_t=a_t ,|, H_t = h_t) \end{equation}$$ which is nothing else that the probability of our agent performing action $a_t$ at time $t$ knowing what has happened so far in terms of states and actions.
We can formalize it as a function $d_t$ from $\mathcal{H}_t$ to $\mathcal{P}(\mathcal{A})$, the space of probability distributions over $\mathcal{A}$: $$\begin{equation} d_t : \mathcal{H}_t \to \mathcal{P}(\mathcal{A}) \end{equation}$$ We shall denote by $b_{d_t}(a\,|\,h_t)$ or $b_{d_t(h_t)}(a)$ the probability of taking action $a$ if our system history up to time $t$ is equal to $h_t$, i.e. $$\begin{equation} \mathbb{P}(A_t=a_t ,|, H_t = h_t) = b_{d_t(h_t)}(a_t) \end{equation}$$ $d_t$ is usually called decision rule because it models how the agent reacts according to its current simulation experience.

Policies

A policy is the recipe underlying the behaviour of our agent - what it is programmed to do under the set of circumstances it is currently experiencing. In other words, how the agent is going to act at each time instant $t\in\{1,\dots\}$.
Decision rules provide a quick and painless way to formally define it: a policy $\pi$ is nothing more than a sequence of decision rules, i.e. $$\begin{equation} \pi = (d_1, d_2, d_3, \dots) \end{equation}$$ with $d_i:\mathcal{H}_i\to\mathcal{P}(\mathcal{A})$ for each $i\in\{1, 2,\dots\}$.

Now that we a proper mathematical definition of policy we can give a rigorous formulation of the reward hypothesis, which we have already stated in the previous post:

That all of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (called reward).

In particular, we can shed some light on the expression "expected value of the cumulative sum of a received scalar signal".

Returns

How do we measure the performance of a policy?
In a Reinforcement Learning context, every action is followed by a reward (positive or negative). To measure the overall performance of a policy we need to combine all these rewards together - this is usually called return in the literature.

The most common reward criteria is the one mentioned in the reward hypothesis: cumulative return.
If $R_{t+1}$ denotes the reward obtained after action $A_t$ we can define the cumulative return at time $t$, $G_t$, as $$\begin{equation} G_t := \sum_{k=t+1}^{T}R_k \end{equation}$$ where $T$ is the random variable denoting the end time of our simulation (e.g. the total number of moves in a chess game).

This definition is perfectly fine if $T$ is almost surely finite, but it becomes problematic for those simulations in which it makes sense to potentially have $T=+\infty$ - an exploration game, for example.
The issue can be promptly solved introducing a discount rate $\gamma\in [0, 1)$ which can be used to define the discounted cumulative return: $$\begin{equation} G_t := \sum_{k=t+1}^{T}\gamma^{k-t-1}R_k \end{equation}$$ $\gamma\in[0,1)$ allows us to associate a bigger weight to closer rewards as well as ensuring that the overall return $G_t$ remains finite if our rewards are bounded: $$\begin{equation} |G_t| \leq \sum_{k=t+1}^{+\infty}\left|\gamma^{k-t-1}R_k\right| = \sum_{k=t+1}^{+\infty}\gamma^{k-t-1}\left|R_k\right| \leq M \sum_{k=t+1}^{+\infty}\gamma^{k-t-1} = \frac{M}{1-\gamma} < +\infty \end{equation}$$ assuming that there exists $M>0$ such that $|R_t|\leq M$ for all $t$.
We will implicitly assume that if $\gamma=1$ then $T\neq +\infty$, and vice versa.

State-value and history-value functions

Let:

Then "expected value of the (discounted) cumulative sum of a received scalar signal" is $$\begin{equation}\label{state-value} v_\pi^1(s_1) := \mathbb{E}_{\pi}\left[G_1:\middle|:S_1=s_1\right] \end{equation}$$ which is usually called state-value function for the policy $\pi$.
The fancy "E", $\mathbb{E}$, stands for the probabilistic expectation of a random variable - look it up if this is the first time you encounter it (link).
The $\pi$ subscript, instead, indicates that this expectation is computed under the hypothesis that the agent is following the policy $\pi$. What does this mean, mathematically?

Well, let's unroll the definition!

Expected reward

First of all we'll start by computing the expected value of the first reward, $R_2$, under policy $\pi$: $$\begin{equation} \mathbb{E}_{\pi}\left[R_2:\middle|:S_1=s_1\right] \end{equation}$$ The expectation of a random variable is nothing more than the sum of all its possible values weighted by their probability of being observed. Formally: $$\begin{equation}\label{expected_return} \mathbb{E}_{\pi}\left[R_2:\middle|:S_1=s_1\right] = \sum_{r\in\mathcal{R}} r, \mathbb{P}(R_2=r,|, S_1=s_1) \end{equation}$$ where $\mathcal{R}$ is the set of possible rewards (which we assume to be finite or countable, in order to avoid integrals).

Using the law of total probability (link) we can decompose those probabilities with respect to the set of possible actions available to the agent ($\mathcal{A}$): $$\begin{equation} \mathbb{P}(R_2=r,|, S_1=s_1) = \sum_{a\in\mathcal{A}}\mathbb{P}(R_2=r,|,A_1=a, S_1=s_1) , \mathbb{P}(A_1=a,|,S_1=s_1) \end{equation}$$ But $\mathbb{P}(A_1=a\,|\,S_1=s_1)$ is fully determined by our agent's first decision rule - $d_1$: $$\begin{equation} \mathbb{P}(A_1=a,|,S_1=s_1) = b_{d_1(s_1)}(a) \end{equation}$$ Thus, going back to equation \ref{expected_return}, we get: $$\begin{equation} \mathbb{E}_{\pi}\left[R_2:\middle|:S_1=s_1\right] = \sum_{r\in\mathcal{R}} r, \sum_{a\in\mathcal{A}}\mathbb{P}(R_2=r,|,A_1=a, S_1=s_1) , b_{d_1(s_1)}(a) \end{equation}$$ which, with a simple reordering, becomes $$\begin{equation} \mathbb{E}_{\pi}\left[R_2:\middle|:S_1=s_1\right] =\sum_{a\in\mathcal{A}} , b_{d_1(s_1)}(a) \bigg(\sum_{r\in\mathcal{R}} r, \mathbb{P}(R_2=r,|,A_1=a, S_1=s_1)\bigg) \end{equation}$$ Using again the definition of mathematical expectation we get $$\begin{equation}\label{agent-env} \mathbb{E}_{\pi}\left[R_2:\middle|:S_1=s_1\right] =\sum_{a\in\mathcal{A}} , \underbrace{b_{d_1(s_1)}(a)}_{\text{agent}} , \underbrace{\mathbb{E}(R_2,|,A_1=a, S_1=s_1)}_{\text{environment}} \end{equation}$$

You can clearly see the different contributions in equation \ref{agent-env} - the policy determines the probability to choose a particular action while the environment is responsible for the expected return awaiting the agent if it were to choose that specific course of action. In fact there is no $\pi$ subscript under the last expectation!

Expected return

We can now derive an explicit formula for the expected discounted return under policy $\pi$.

First of all, expectation is linear - i.e. $$\begin{equation} \mathbb{E}[\alpha A+\beta B] = \alpha \mathbb{E}[A] + \beta \mathbb{E}[B] \end{equation}$$ where $A$ and $B$ are random variables with finite expectation and $\alpha$ and $\beta$ are real numbers.

Then: $$\begin{align} v_\pi^1(s_1) &= \mathbb{E}_{\pi}\left[G_1:\middle|:S_1=s_1\right]= \ &=\mathbb{E}_{\pi}\left[R_2 + \gamma G_2:\middle|:S_1=s_1\right]= \ &=\underbrace{\mathbb{E}_{\pi}\left[R_2:\middle|:S_1=s_1\right]}_{\triangle} +\gamma , \underbrace{\mathbb{E}_\pi \left[G_2:\middle|:S_1=s_1\right]}_{\square}\label{decomposition} \end{align}$$

We have already derived a formula for $\triangle$ - let's focus on $\square$.
Using again the law of total probability and the definition of expected value we get: $$\begin{equation} \mathbb{E}_{\pi}\left[G_2:\middle|:S_1=s_1\right] = \sum_{g} g, \sum_{a\in\mathcal{A}}\mathbb{P}(G_2=g,|,A_1=a, S_1=s_1) , b_{d_1(s_1)}(a) \end{equation}$$ But we can do it again - this time decomposing with respect to $S_2$: $$\begin{align} \mathbb{P}(G_2=g,|,A_1=a, S_1=s_1) &= \sum_{s\in\mathcal{S}} \mathbb{P}(G_2=g,|,S_2=s, A_1=a, S_1=s_1) , \mathbb{P}(S_2=s,|, A_1=a, S_1=s_1)= \ &=\sum_{s\in\mathcal{S}} \mathbb{P}\big(G_2=g,|,H_2=(s_1, a, s)\big) , \mathbb{P}(S_2=s,|, A_1=a, S_1=s_1) \end{align}$$ Putting everything together we get: $$\begin{align} \mathbb{E}_{\pi}\left[G_2:\middle|:S_1=s_1\right] &= \sum_{g} g, \sum_{a\in\mathcal{A}}\sum_{s\in\mathcal{S}} \mathbb{P}\big(G_2=g,|,H_2=(s_1, a_1, s)\big) , b_{d_1(s_1)}(a), \mathbb{P}(S_2=s,|, A_1=a, S_1=s_1) = \ &= \sum_{a\in\mathcal{A}}\sum_{s\in\mathcal{S}}\bigg[ \sum_{g} g , \mathbb{P}\big(G_2=g,|,H_2=(s_1, a_1, s)\big)\bigg] , b_{d_1(s_1)}(a), \mathbb{P}(S_2=s,|, A_1=a, S_1=s_1)= \ &=\sum_{a\in\mathcal{A}}\sum_{s\in\mathcal{S}} \underbrace{\mathbb{E}_\pi\big[G_2,|,H_2=(s_1, a_1, s)\big]}_{\bigstar} , \underbrace{b_{d_1(s_1)}(a)}_{\text{agent}} , \underbrace{\mathbb{P}(S_2=s,|, A_1=a, S_1=s_1)}_{\text{environment}} \end{align}$$

Let's take a step back and look at what we got here: $\bigstar$ is not that different from the state-value function we have defined in equation \ref{state-value}: the only difference is that we are computing the expected discounted cumulative return starting from $t=2$ instead of $t=1$, as well as conditioning on $H_2$ instead of $S_1=H_1$.

We can formalize this intuition introducing the history-value function for policy $\pi$ at time $t$: $$\begin{equation} v^t_\pi(h_t):= \mathbb{E}_{\pi}\left[G_t:\middle|:H_t=h_t\right] \end{equation}$$

Plugging this new object into our previous computations we obtain: $$\begin{equation}\label{step-forward} \mathbb{E}_{\pi}\left[G_2:\middle|:S_1=s_1\right] =\sum_{a\in\mathcal{A}}\sum_{s\in\mathcal{S}} \underbrace{v_\pi^2((s_1, a, s))}_{\bigstar} , \underbrace{b_{d_1(s_1)}(a)}_{\text{agent}} , \underbrace{\mathbb{P}(S_2=s,|, A_1=a, S_1=s_1)}_{\text{environment}} \end{equation}$$ Thanks to equations \ref{decomposition}, \ref{agent-env} and \ref{step-forward} we get our general formula for the expected cumulative discounted return under policy $\pi$: $$\begin{equation}\label{bellman} v_\pi^1(s_1) = \sum_{a\in\mathcal{A}} , \underbrace{b_{d_1(s_1)}(a)}_{\text{agent}} , \bigg( \underbrace{\mathbb{E}[R_2,|,A_1=a, S_1=s_1]}_{\text{environment}} + \gamma , \sum_{s\in\mathcal{S}} v_\pi^2\big((s_1, a, s)\big) , \underbrace{\mathbb{P}(S_2=s,|, A_1=a, S_1=s_1)}_{\text{environment}} \bigg) \end{equation}$$

This is a generalized version of what is usually called Bellman equation in the framework of Markov Decision Processes.

Bellman equation

Using the same techniques we have employed in the previous two subsections we can prove that the following equation holds for every $t\in\{1, 2,\dots\}$: $$\begin{equation} v_\pi^t(h_t) = \sum_{a\in\mathcal{A}} , b_{d_t(h_t)}(a), \bigg( \mathbb{E}[R_{t+1},|,A_t=a, H_t=h_t]+ \gamma , \sum_{s\in\mathcal{S}} v_\pi^{t+1}\big((h_t, a, s)\big) , \mathbb{P}(S_{t+1}=s,|, A_t=a, H_t=h_t)\bigg) \end{equation}$$

You can clearly spot the issue here: to compute $v_\pi^1$ we need to know $v_\pi^2$, which in turn requires $v_\pi^3$... there is no end in sight, assuming that $T$ is potentially infinite.
Things would simplify significantly if we were to assume that the environment and the agent are markovian and stationary - $S_t$ and $R_{t}$ depend only on $A_{t-1}$ and $S_{t-1}$, while the agent bases its $A_t$ choice only on the value of $S_{t-1}$, without taking into account the current time instant $t$ - i.e. $\pi=(d_1, d_1, \dots)$.
With these further hypotheses it can be proved (and we will, in a later episode) that equation \ref{bellman} becomes: $$\begin{equation} v_\pi^1(s_1) = \sum_{a\in\mathcal{A}} , b_{d_1(s_1)}(a), \bigg( \mathbb{E}[R_2,|,A_1=a, S_1=s_1] + \gamma , \sum_{s\in\mathcal{S}} v_\pi^1(s_1) , \mathbb{P}(S_2=s,|, A_1=a, S_1=s_1) \bigg) \end{equation}$$

A closed relation!

We can actually craft an algorithm to find $v_\pi^1$ and we can use its value to compare different policies.
A policy value-function, in fact, is the expected cumulative discounted reward following that strategy: if $v_{\tau}^1(s) \geq v_{\pi}^1(s)$ for every $s\in\mathcal{S}$ we can generally conclude that $\tau$ is a better policy than $\pi$.
We might then ask ourselves if there exists an optimal policy - a policy whose expected cumulative discounted return is greater or equal than the one associated with any other policy. Does it exists? Can we find it in the subset of markovian stationary policy? Is there a practical algorithm to do so?

These, and many other related questions, will be the main topic of our next episode.

Reinforcement Learning series index