Stochastic Policy for Continuous Control
Example Continuous Action Space: A robotic arm
The action space \(\mathcal{A}\) is continuous:
Actions are 2-dim vectors.
Policy Network for Continuous Control
Univariate Normal Distribution
Assume the degree of freedom is one, i.e., \(\mathcal{A} \subset \mathbb{R}\).
Let \(\mu\) (mean) and \(\sigma\) (std) be functions of \(s\).
Let policy function be the PDF of normal distribution:
Multivariate Normal Distribution
Let the degree of freedom be \(d\), i.e., action \(a\) is \(d\)-dim.
Let \(𝛍, 𝛔: \mathcal{S} \mapsto \mathbb{R}^d\) be functions of \(s\).
Let \(\mu_i\) and \(\sigma_i\) be the \(i\)-th elements of \(𝛍(s)\) and \(𝛔(s)\), respectively.
Let policy function be the PDF of multivariate normal:
But \(𝛍\) and \(𝛔\) (wich are functions of \(s\)) are unknown. So need function approximation:
Approximate the mean, \(𝛍(s)\), by the neural network, \(𝛍(s; 𝛉^{\mu})\).
Approximate the std, \(𝛔\), by the neural network, \(𝛔(s; 𝛉 ^ σ)\).(The effect is bad if approximate the std directly.)A better practice is to approximate the log variance:
Approximate \(𝛒\) by the neural network, \(𝛒(s; 𝛉^{\rho})\).
Structure:
Continuous Control
Observe state \(s\).
Compute mean and log variance using the neural network:
Compute
Randomly sample action \(a\) by
Training Policy Network
Auxiliary network (for computing policy gradient).
Policy gradient methods:
REINFORCE,
Actor-Critic.
Auxiliary Network
As auxiliary network compute policy gradient, the auxiliary network should be differentiated.
The policy network is:
The natural log of the policy network is (\(\sigma ^ 2 = \exp(\rho_i)\)):
Let \(𝛉 = (𝛉^{\mu}, 𝛉^{\rho})\):
Let \(f(s, \mathbf{a};𝛉) = \sum_{i=1}^{d} \left[ - \frac{\rho_i}{2} - \frac{(a_i - \mu_i)^2}{2 \cdot \exp(\rho_i)} \right]\), \(f(s, \mathbf{a};𝛉)\) is auxiliary network.
Structure:
Obviously, \(f\) depend on the backpropagation of the argument of conv and dense. The gradient, \(\dfrac{\partial f}{\partial 𝛉}\), can be automatically computed by PyTorch or TensorFlow.
We have built three neural networks:
\(𝛍(s; 𝛉^{\mu})\) computes the mean.
\(𝛒(s; 𝛉^{\rho})\) computes the log variance.
(\(𝛍\) and \(𝛒\) is for controlling the agent)
Auxiliary network, \(f(s, \mathbf{a}; 𝛉)\), helps with training.
We will use \(\dfrac{\partial f}{\partial 𝛉}\) for computing policy gradient.
Policy Gradient Methods
Stochastic policy gradient:
\(\dfrac{\partial f}{\partial 𝛉}\) can be automatically computed by PyTorch or TensorFlow, but \(Q_\pi(s, \mathbf{a})\) is unknown. Approximate \(Q_\pi(s, \mathbf{a})\):
REINFORCE: approximates \(Q_\pi(s_t, \mathbf{a}_t)\) by the observed return:
Update policy network by:
Actor-critic: approximates \(Q_\pi\) by the value network, \(q(s, \mathbf{a}; \mathbf{w})\).
Update policy network by:
Update value network, \(q(s, \mathbf{a}; \mathbf{w})\), by TD learning.\(𝐅\)