Stochastic Policy for Continuous Control

Example Continuous Action Space: A robotic arm

  • The action space \(\mathcal{A}\) is continuous:

\[ \mathcal{A} = [0^\circ, 360^\circ] \times [0^\circ, 180^\circ]. \]
  • Actions are 2-dim vectors.

Policy Network for Continuous Control

Univariate Normal Distribution

  • Assume the degree of freedom is one, i.e., \(\mathcal{A} \subset \mathbb{R}\).

  • Let \(\mu\) (mean) and \(\sigma\) (std) be functions of \(s\).

  • Let policy function be the PDF of normal distribution:

\[ \pi(a \mid s) = \underset{\mathcal{N}(\mu, \sigma^2)}{\underbrace{\frac{1}{\sqrt{2\pi}\,\sigma} \cdot \exp\!\left( - \frac{(a - \mu)^2}{2\sigma^2} \right)}}. \]

Multivariate Normal Distribution

  • Let the degree of freedom be \(d\), i.e., action \(a\) is \(d\)-dim.

  • Let \(𝛍, 𝛔: \mathcal{S} \mapsto \mathbb{R}^d\) be functions of \(s\).

  • Let \(\mu_i\) and \(\sigma_i\) be the \(i\)-th elements of \(𝛍(s)\) and \(𝛔(s)\), respectively.

  • Let policy function be the PDF of multivariate normal:

\[ \pi(a \mid s) = \prod_{i=1}^{d} \frac{1}{\sqrt{2\pi}\,\sigma_i} \cdot \exp\!\left(- \frac{(a_i - \mu_i)^2}{2\sigma_i^2} \right). \]

But \(𝛍\) and \(𝛔\) (wich are functions of \(s\)) are unknown. So need function approximation:

  • Approximate the mean, \(𝛍(s)\), by the neural network, \(𝛍(s; 𝛉^{\mu})\).

  • Approximate the std, \(𝛔\), by the neural network, \(𝛔(s; 𝛉 ^ σ)\).(The effect is bad if approximate the std directly.)

  • A better practice is to approximate the log variance:

\[ \rho_i = \ln \sigma_i^2, \quad \text{for } i = 1, \cdots, d. \]
  • Approximate \(𝛒\) by the neural network, \(𝛒(s; 𝛉^{\rho})\).

Structure:

\[ \text{state }s \overset{\mathrm{Conv}}{→} \mathrm{feature~vector} \left\{\begin{aligned} \overset{\mathrm{Dense1}}{→} & 𝛍(s; 𝛉^{\mu}) \\ \overset{\mathrm{Dense2}}{→} & 𝛒(s; 𝛉^{\rho}) \end{aligned} \right. \]

Continuous Control

  • Observe state \(s\).

  • Compute mean and log variance using the neural network:

\[ \hat{𝛍} = 𝛍(s; 𝛉^{\mu}) \quad \text{and} \quad \hat{𝛒} = 𝛒(s; 𝛉^{\rho}). \]
  • Compute

\[ \hat{\sigma}_i^2 = \exp(\hat{\rho}_i), \quad \text{for all } i = 1, \cdots, d. \]
  • Randomly sample action \(a\) by

\[ a_i \sim \mathcal{N}(\hat{\mu}_i, \hat{\sigma}_i^2), \quad \text{for all } i = 1, \cdots, d. \]

Training Policy Network

  1. Auxiliary network (for computing policy gradient).

  2. Policy gradient methods:

  • REINFORCE,

  • Actor-Critic.

Auxiliary Network

As auxiliary network compute policy gradient, the auxiliary network should be differentiated.

  • The policy network is:

\[ \pi(a \mid s; 𝛉^{\mu}, 𝛉^{\rho}) = \prod_{i=1}^{d} \frac{1}{\sqrt{2\pi}\,\sigma_i} \cdot \exp\!\left(- \frac{(a_i - \mu_i)^2}{2\sigma_i^2} \right). \]
  • The natural log of the policy network is (\(\sigma ^ 2 = \exp(\rho_i)\)):

\[ \begin{aligned} \ln \pi(a \mid s; 𝛉^{\mu}, 𝛉^{\rho}) & = \sum_{i=1}^{d} \left[- \ln \sigma_i - \frac{(a_i - \mu_i)^2}{2\sigma_i^2} \right] + \text{const} \\ & = \sum_{i=1}^{d} \left[ - \frac{\rho_i}{2} - \frac{(a_i - \mu_i)^2}{2 \cdot \exp(\rho_i)} \right] + \text{const}. \end{aligned} \]

Let \(𝛉 = (𝛉^{\mu}, 𝛉^{\rho})\):

\[ \ln \pi(\mathbf{a} \mid s; 𝛉^{\mu}, 𝛉^{\rho}) = \sum_{i=1}^{d} \left[ - \frac{\rho_i}{2} - \frac{(a_i - \mu_i)^2}{2 \cdot \exp(\rho_i)} \right] + \text{const}. \]

Let \(f(s, \mathbf{a};𝛉) = \sum_{i=1}^{d} \left[ - \frac{\rho_i}{2} - \frac{(a_i - \mu_i)^2}{2 \cdot \exp(\rho_i)} \right]\), \(f(s, \mathbf{a};𝛉)\) is auxiliary network.

Structure:

\[ \begin{aligned} \text{state }s \overset{\mathrm{Conv}}{→} \mathrm{feature~vector} \left\{\begin{aligned} \overset{\mathrm{Dense1}}{→} & 𝛍(s; 𝛉^{\mu}) \\ \overset{\mathrm{Dense2}}{→} & 𝛒(s; 𝛉^{\rho}) \end{aligned} \right\} & \underset{↑}{f(s, \mathbf{a};𝛉)}\\ & \text{action }\mathbf{a} \end{aligned} \]

Obviously, \(f\) depend on the backpropagation of the argument of conv and dense. The gradient, \(\dfrac{\partial f}{\partial 𝛉}\), can be automatically computed by PyTorch or TensorFlow.

  • We have built three neural networks:

\[ 𝛍(s; 𝛉^{\mu}), \quad 𝛒(s; 𝛉^{\rho}), \quad f(s, \mathbf{a}; 𝛉). \]
  • \(𝛍(s; 𝛉^{\mu})\) computes the mean.

  • \(𝛒(s; 𝛉^{\rho})\) computes the log variance.

(\(𝛍\) and \(𝛒\) is for controlling the agent)

  • Auxiliary network, \(f(s, \mathbf{a}; 𝛉)\), helps with training.

  • We will use \(\dfrac{\partial f}{\partial 𝛉}\) for computing policy gradient.

Policy Gradient Methods

Stochastic policy gradient:

\[ \begin{aligned} & \mathbf{g}(\mathbf{a}) = \frac{\partial \ln \pi(\mathbf{a} \mid s; 𝛉)}{\partial 𝛉} \cdot Q_\pi(s, \mathbf{a}) \\ ⇒ & \mathbf{g}(\mathbf{a}) = \frac{\partial f(s, \mathbf{a}; 𝛉)}{\partial 𝛉} \cdot Q_\pi(s, \mathbf{a}). \end{aligned} \]

\(\dfrac{\partial f}{\partial 𝛉}\) can be automatically computed by PyTorch or TensorFlow, but \(Q_\pi(s, \mathbf{a})\) is unknown. Approximate \(Q_\pi(s, \mathbf{a})\):

  1. REINFORCE: approximates \(Q_\pi(s_t, \mathbf{a}_t)\) by the observed return:

\[ u_t = r_t + \gamma \cdot r_{t+1} + \gamma^2 \cdot r_{t+2} + \gamma^3 \cdot r_{t+3} + \cdots \]
  • Update policy network by:

\[ 𝛉 \leftarrow 𝛉 + \beta \cdot \frac{\partial f(s, \mathbf{a}; 𝛉)}{\partial 𝛉} \cdot u_t. \]
  1. Actor-critic: approximates \(Q_\pi\) by the value network, \(q(s, \mathbf{a}; \mathbf{w})\).

  • Update policy network by:

\[ 𝛉 \leftarrow 𝛉 + \beta \cdot \frac{\partial f(s, \mathbf{a}; 𝛉)}{\partial 𝛉} \cdot q(s, \mathbf{a}; \mathbf{w}). \]
  • Update value network, \(q(s, \mathbf{a}; \mathbf{w})\), by TD learning.\(𝐅\)