PILCO is one of the most important Model-Based Reinforcement Learning algorithms draws a lot of attention from researchers. It is using Gaussian Process as the function approximation instead of mainstream Neural Networks. Then the loss function is back-propagated to update the parameters in the controller.
This algorithm was the PhD topic of the author Marc Deisenroth. There are more related papers to explain the detail of the algorithm. I think PILCO offered a way to model the uncertainty of the world model, by using covariance matrices.
I am writing this blog based on the original paper[Deisenroth, M., & Rasmussen, C. E. (2011)]. And the GitHub repository https://github.com/nrontsis/PILCO
For nearly all the letters and symbols you can found corresponding variables in this repository.
Hope it will help on your research. Please bear with my English skills. I am trying to improve it.
Happy Hacking! ╭(●`∀´●)╯╰(●’◡’●)╮
Part 1. Gaussian Process.
Here comes the PILCO, which is using the Gaussian Process(GP) as the world model. Before we get into the PILCO, we need to understand the Gaussian Process first.
Point 1. Gaussian Process is a data-driven learning algorithm(function approximator)
Instead of updating parameters of linear functions(like a neural network) to learn, GP do prediction based on how close the unknow points to known points. This distance is measured by a kernel.
Another name for these methods is non-parametric learning methods.
Point 2. Gaussian Process generalize to unseen points with a kernel
A kernel is a measure of how close two data points are. It is used as a covariance function in the Multivariate Gaussian distribution, which is used to model the input, i.e, describes the covariance of the Gaussian process random variables.
Commonly used kernels(From GPytorch library):
If the data is a 2D data with shape (1,2), then the kernel shape should be (2, 2).
Point 3. Gaussian Process Prediction Intuition
In case you are confused, first thing first, there is no function between x and y. All GP need to predict a new data point is the distance of this new data to other known data. After it has such a distance measure, the predicted value will be linked with the corresponding y.
Intuition Example (Not how the GP works):
We know a corresponding:
when x = 1, y = 5
when x = 20, y =47
when x = 100, y = 207.
Now, we want to know what is y* = ?, when x* = 25 correspondings to.
First, we compute the distance in x space. We have distance (25–1 = 24), (25–20 = 5), (100–25 = 75). Total distance will be (24 + 5 + 75 = 104).
When we try to predict, we want to say that since the x* = 25 is closer to x=20, we would say, it is more likely that y* is close to 47. But I want to take consideration of other points. So one thing we can do is we can take the weighted sum of known ys. The weight is the inversed and normalized distances.
y* = 5 * ((104–24)/(2*104)) + 47 * ((104–5)/(2*104)) + 207* ((104–75)/(2*104)) = 53.15
Gaussian Process is a distribution over functions. The weights are decided by the kernel. And the x will only affect the kernel i.e, weights. The predicted y is kinda like a weighted sum of known ys. There is no function link x and y.
For Gaussian Process, a formal implication is the Multivariate Gaussian Theorem.
Assume the distribution of dependent variable Y is modelled by a Multivariate Gaussian Distribution:
Where the K means the kernel for Xs in the known data.
Where the K* means the kernel for Xs in the known data and X*s in the unknown data.
Where the K** means the kernel for X* in the unknown data.
The Y* we want to predict is under the Gaussian Distribution with:
Point 4. Hyperparameters and training of Gaussian Process.
As we can see from above, there is no “training” at all needed. The training of Gaussian Process is referred to update the hyperparameters(lengthscale, variance, likelihood) with the Maximum Likelihood Optimization.
For an RBF kernel:
Lengthscale: the “l” in the formula
Output scale(variance): the “σf” in the first part
Noise(likelihood): the “σ n” in the second part, controls noise modelling.
Higher “l” values lead to smoother functions and therefore to coarser approximations of the training data. Lower “l” values make functions more wiggly with wide confidence intervals between training data points. “σf” controls the vertical variation of functions drawn from the GP. This can be seen by the wide confidence intervals outside the training data region in the right figure of the second row. “σn” represents the amount of noise in the training data. Higher “σn” values make more coarse approximations which avoids overfitting to noisy data. — — — http://krasserm.github.io/2018/03/19/gaussian-processes/
Point 5. Drawbacks
- The (K*K^(-1)) is an inverse matrix, which will take O(n³) to compute.
- Due to the fact that it is data-driven, the learning is offline.
Part 2 PILCO
Now, Let’s get into PILCO.
In general, there are three parts involved in the PILCO.
(1). Gaussian Process World Model
(2). A Loss function that computes how far away a state is from the target.
(3). A controller that propose an action.
What we want is a controller that can achieve high rewards(or low loss function). That involves taking the derivative of the controller’s parameters with respect to the loss function. Taking such a derivative can be done automatically after we link everything up with mainstream off the shelf libraries like Tensorflow or Pytorch.
One thing worth to note in PILCO is that, instead of taking a specific value, function approximator takes the distribution. In other words, the system is processing distributions.
Note: World model is different from transition probability.
- The current state “x_{t-1}” and action “u_{t-1}” are concatenated to form the input.
The model “f” will take such input, and predict the next state “x_t”
2. The next state is under Gaussian Distribution. The distribution is defined by its’ mean and covariance. The world model is predicting the difference between current state “x_{t-1}” and “x{t}”, given action “u_t”. The covariance of the next state prediction is provided by the world model. It is indicating how sure the world model about this prediction.
3. The kernel that the Gaussian Process is a squared exponential (SE) kernel. The “Λ” is the reciprocal of the lengthscale.
The alpha is the variance.
There is no noise term here for now. It is in the (7)
4. In the paper, they write out the Gaussian Process prediction as well. Which is similar to what we had above. The lower case k is the x for the points we want to predict. Noise term was taken consideration with “σε”.
Remember, the β is defined as the (K + σ²εI)^-1 * y. It will show up in the code.
5. Get the contribution of input states.
In order to predict next state x_t, the distribution of concatenated input p(x_{t−1}, u_{t−1}) is needed. Assume that such distribution is a Gaussian distribution. The mean of the concatenated state-action is easy to compute. Only the covariance is the hard part. We need covariance from both u and x. Then concatenate them into a big covariance matrix.
p(x, u) = N(0, cov[x_{t−1}, u_{t−1}])
However, the u_{t-1} distribution depends on controller p(u|x, θ). To get the distribution of u_{t-1}, they integrate out the influence of state x. The state x’s covariance does not depend on anything, so we can just use it.
p(u) = ∫ p(x, u) dx = ∫ p(u|x, θ) * p(x) dx
6. With the world model “f”, we can compute the transition probability p(f(x_{t−1})|x_{t−1}).
The distribution of states difference ∆t is from the weighted sum of transition probability. We integrate out the random variable x_{t−1} to get the p(∆t).
Computing the exact predictive distribution p(∆t) in Eq. (9) is analytically intractable. Therefore, we approximate p(∆t) by a Gaussian using exact moment matching.
7. From difference to real next state.
For mean, we just add the difference. For covariance, cross-covariance is computed first, then added. The way fo computing cross-covariance is provided in another his paper[Deisenroth, M. P.(2010)].
8. Mean prediction
Now it’s time to have a look at how to implement it. As the (9) shows, the current states & action x^~_{t-1} are integrated out((13)). And the “mf” is the expectation from GP prediction, which is referred back to the β around equation (7). The “qa” here is equivalent to the k* transpose in the equation (7).
Here, for each dimension “a” in the output, it will have one GP corresponding to it. For example in the MountainCar, the state is position and velocity, actions are concatenated. The prediction is the next position and the next velocity.
Further, for each entry in the kernel in the “qa” or k*, replace it with the kernel formula (6), it ends up with the formula from “x” to output “q”, no more k.
9. Covariance Prediction
For the covariance matrix prediction, they distinguished between diagonal elements “aa” and off-diagonal elements “ab”. “a” and “b” are the row and column of the target covariance matrix.
the law of iterated variances(from Wikipedia):
We distinguish between diagonal elements and off-diagonal elements: Using the law of iterated variances, we obtain for target dimensions a, b = 1, . . . , D
The off-diagonal terms do not contain the additional term Ex˜t−1 [covf [∆a, ∆b|x˜t−1]] because of the conditional independence assumption of the GP models: Different target dimensions do not covary for given x˜t−1.
The computation of μ_a, μ_b are from the last section, the mean prediction.
Since the off-diagonal and diagonal will share the diagonal part, First we compute the terms that are common to both the diagonal and off-diagonal entries, with the law of iterated expectation(different law from above E(X) = E(E(X|Y))):
the law of iterated expectation(see the end for reference):
the probability of rain tomorrow can be calculated by considering both cases (it rained today/it did not rain today) in turn. To use specific numbers, suppose that
The probability of rain today is 70%
If it rains today, it will rain tomorrow with probability 30%
If it does not rain today, it will rain tomorrow with probability 90%
In this case, the probability of rain tomorrow is
0.7 * 0.3 + 0.3 * 0.9 = 0.21 + 0.27 = 0.48 =48%
The math here is a little bit annoying. I need to do more Google.
╮(๑•́ ₃•̀๑)╭
Things should be more clear now. We would like to do the same thing as what we did for the mean prediction. Replacing the mean “m” with β, and then “k(or q)” with the kernel formula.
Then for each entry of Q, use the kernel formula (6) to have the detail computation.
Now we get the σ_ab done. The rest of the Equation (17) can be done by:
10. Loss function (Policy Evaluation)
The loss function is defined with the difference between the state and the target state.
The policy is evaluated with the expected return. To do it, set the controller to evaluation mode, and then discover the distribution of the states. The expected return is computed with the weighted sum of the cost for each state and the state distribution.
11. Policy Improvement
After linking the world model and controller together to the loss function. The parameter of the controller can be updated by propagating the loss back to the controller parameters by the Chain Rule.
References:
Deisenroth, M., & Rasmussen, C. E. (2011). PILCO: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11) (pp. 465–472).
Leclercq, F. (2018). Bayesian optimization for likelihood-free cosmological inference. Physical Review D, 98(6), 063511.
Ebden, M. (2008). Gaussian Processes for Regression: A Quick Introduction.[online] Available at:< http://www. robots. ox. ac. uk/~ mebden/reports. GPtutorial. pdf.
Law of Iterated Expectation | Brilliant Math & Science Wiki. (2020). Brilliant. https://brilliant.org/wiki/law-of-iterated-expectation/
Deisenroth, M. P.(2010) Efficient Reinforcement Learning using Gaussian Processes. KIT Scientific Publishing, 2010. ISBN 978–3–86644–569–7.