______

|_ _| \| | _ \ ____| \/ /

| || | | | |_) )____|) (

|___|_|\|/___|_/\_\

+dwb===================dwb+

[<]サイバー環状線的随笔

ノート環状線

OpenClaw执行命令于Docker上

深度强化学习

初识深度强化学习 Value-based RL Policy-based RL Actor-Critic Method Alpha Go 蒙特卡洛算法 Experience Replay Dueling Network MARL Policy Gradient with Baseline Reinforce with Baseline A2C DPG Stochastic Policy for Continuous Control

通信视觉

WEB

改变 𝕏 网页端默认字体时需注入的 CSS

Stochastic Policy for Continuous Control

Example Continuous Action Space: A robotic arm

The action space \(\mathcal{A}\) is continuous:

\[ \mathcal{A} = [0^\circ, 360^\circ] \times [0^\circ, 180^\circ]. \]

Actions are 2-dim vectors.

Policy Network for Continuous Control

Univariate Normal Distribution

Assume the degree of freedom is one, i.e., \(\mathcal{A} \subset \mathbb{R}\).
Let \(\mu\) (mean) and \(\sigma\) (std) be functions of \(s\).
Let policy function be the PDF of normal distribution:

\[ \pi(a \mid s) = \underset{\mathcal{N}(\mu, \sigma^2)}{\underbrace{\frac{1}{\sqrt{2\pi}\,\sigma} \cdot \exp\!\left( - \frac{(a - \mu)^2}{2\sigma^2} \right)}}. \]

Multivariate Normal Distribution

Let the degree of freedom be \(d\), i.e., action \(a\) is \(d\)-dim.
Let \(𝛍, 𝛔: \mathcal{S} \mapsto \mathbb{R}^d\) be functions of \(s\).
Let \(\mu_i\) and \(\sigma_i\) be the \(i\)-th elements of \(𝛍(s)\) and \(𝛔(s)\), respectively.
Let policy function be the PDF of multivariate normal:

\[ \pi(a \mid s) = \prod_{i=1}^{d} \frac{1}{\sqrt{2\pi}\,\sigma_i} \cdot \exp\!\left(- \frac{(a_i - \mu_i)^2}{2\sigma_i^2} \right). \]

But \(𝛍\) and \(𝛔\) (wich are functions of \(s\)) are unknown. So need function approximation:

Approximate the mean, \(𝛍(s)\), by the neural network, \(𝛍(s; 𝛉^{\mu})\).
~~Approximate the std, \(𝛔\), by the neural network, \(𝛔(s; 𝛉 ^ σ)\).~~(The effect is bad if approximate the std directly.)
A better practice is to approximate the log variance:

\[ \rho_i = \ln \sigma_i^2, \quad \text{for } i = 1, \cdots, d. \]

Approximate \(𝛒\) by the neural network, \(𝛒(s; 𝛉^{\rho})\).

Structure:

\[ \text{state }s \overset{\mathrm{Conv}}{→} \mathrm{feature~vector} \left\{\begin{aligned} \overset{\mathrm{Dense1}}{→} & 𝛍(s; 𝛉^{\mu}) \\ \overset{\mathrm{Dense2}}{→} & 𝛒(s; 𝛉^{\rho}) \end{aligned} \right. \]

Continuous Control

Observe state \(s\).
Compute mean and log variance using the neural network:

\[ \hat{𝛍} = 𝛍(s; 𝛉^{\mu}) \quad \text{and} \quad \hat{𝛒} = 𝛒(s; 𝛉^{\rho}). \]

Compute

\[ \hat{\sigma}_i^2 = \exp(\hat{\rho}_i), \quad \text{for all } i = 1, \cdots, d. \]

Randomly sample action \(a\) by

\[ a_i \sim \mathcal{N}(\hat{\mu}_i, \hat{\sigma}_i^2), \quad \text{for all } i = 1, \cdots, d. \]

Training Policy Network

Auxiliary network (for computing policy gradient).
Policy gradient methods:

REINFORCE,
Actor-Critic.

Auxiliary Network

As auxiliary network compute policy gradient, the auxiliary network should be differentiated.

The policy network is:

\[ \pi(a \mid s; 𝛉^{\mu}, 𝛉^{\rho}) = \prod_{i=1}^{d} \frac{1}{\sqrt{2\pi}\,\sigma_i} \cdot \exp\!\left(- \frac{(a_i - \mu_i)^2}{2\sigma_i^2} \right). \]

The natural log of the policy network is (\(\sigma ^ 2 = \exp(\rho_i)\)):

\[ \begin{aligned} \ln \pi(a \mid s; 𝛉^{\mu}, 𝛉^{\rho}) & = \sum_{i=1}^{d} \left[- \ln \sigma_i - \frac{(a_i - \mu_i)^2}{2\sigma_i^2} \right] + \text{const} \\ & = \sum_{i=1}^{d} \left[ - \frac{\rho_i}{2} - \frac{(a_i - \mu_i)^2}{2 \cdot \exp(\rho_i)} \right] + \text{const}. \end{aligned} \]

Let \(𝛉 = (𝛉^{\mu}, 𝛉^{\rho})\):

\[ \ln \pi(\mathbf{a} \mid s; 𝛉^{\mu}, 𝛉^{\rho}) = \sum_{i=1}^{d} \left[ - \frac{\rho_i}{2} - \frac{(a_i - \mu_i)^2}{2 \cdot \exp(\rho_i)} \right] + \text{const}. \]

Let \(f(s, \mathbf{a};𝛉) = \sum_{i=1}^{d} \left[ - \frac{\rho_i}{2} - \frac{(a_i - \mu_i)^2}{2 \cdot \exp(\rho_i)} \right]\), \(f(s, \mathbf{a};𝛉)\) is auxiliary network.

Structure:

\[ \begin{aligned} \text{state }s \overset{\mathrm{Conv}}{→} \mathrm{feature~vector} \left\{\begin{aligned} \overset{\mathrm{Dense1}}{→} & 𝛍(s; 𝛉^{\mu}) \\ \overset{\mathrm{Dense2}}{→} & 𝛒(s; 𝛉^{\rho}) \end{aligned} \right\} & \underset{↑}{f(s, \mathbf{a};𝛉)}\\ & \text{action }\mathbf{a} \end{aligned} \]

Obviously, \(f\) depend on the backpropagation of the argument of conv and dense. The gradient, \(\dfrac{\partial f}{\partial 𝛉}\), can be automatically computed by PyTorch or TensorFlow.

We have built three neural networks:

\[ 𝛍(s; 𝛉^{\mu}), \quad 𝛒(s; 𝛉^{\rho}), \quad f(s, \mathbf{a}; 𝛉). \]

\(𝛍(s; 𝛉^{\mu})\) computes the mean.
\(𝛒(s; 𝛉^{\rho})\) computes the log variance.

(\(𝛍\) and \(𝛒\) is for controlling the agent)

Auxiliary network, \(f(s, \mathbf{a}; 𝛉)\), helps with training.
We will use \(\dfrac{\partial f}{\partial 𝛉}\) for computing policy gradient.

Policy Gradient Methods

Stochastic policy gradient:

\[ \begin{aligned} & \mathbf{g}(\mathbf{a}) = \frac{\partial \ln \pi(\mathbf{a} \mid s; 𝛉)}{\partial 𝛉} \cdot Q_\pi(s, \mathbf{a}) \\ ⇒ & \mathbf{g}(\mathbf{a}) = \frac{\partial f(s, \mathbf{a}; 𝛉)}{\partial 𝛉} \cdot Q_\pi(s, \mathbf{a}). \end{aligned} \]

\(\dfrac{\partial f}{\partial 𝛉}\) can be automatically computed by PyTorch or TensorFlow, but \(Q_\pi(s, \mathbf{a})\) is unknown. Approximate \(Q_\pi(s, \mathbf{a})\):

REINFORCE: approximates \(Q_\pi(s_t, \mathbf{a}_t)\) by the observed return:

\[ u_t = r_t + \gamma \cdot r_{t+1} + \gamma^2 \cdot r_{t+2} + \gamma^3 \cdot r_{t+3} + \cdots \]

Update policy network by:

\[ 𝛉 \leftarrow 𝛉 + \beta \cdot \frac{\partial f(s, \mathbf{a}; 𝛉)}{\partial 𝛉} \cdot u_t. \]

Actor-critic: approximates \(Q_\pi\) by the value network, \(q(s, \mathbf{a}; \mathbf{w})\).

Update policy network by:

\[ 𝛉 \leftarrow 𝛉 + \beta \cdot \frac{\partial f(s, \mathbf{a}; 𝛉)}{\partial 𝛉} \cdot q(s, \mathbf{a}; \mathbf{w}). \]

Update value network, \(q(s, \mathbf{a}; \mathbf{w})\), by TD learning.\(𝐅\)

[<<]Deterministic Policy Gradient

通信[>>]

|三

ノート環状線

[ERROR]连接出错，请重试

______

|_ _| \| | _ \ ____| \/ /

| || | | | |_) )____|) (

|___|_|\|/___|_/\_\

+dwb===================dwb+

Stochastic Policy for Continuous Control

Policy Network for Continuous Control

Univariate Normal Distribution

Multivariate Normal Distribution

Continuous Control

Training Policy Network

Auxiliary Network

Policy Gradient Methods

|三

ノート環状線

[ERROR]连接出错，请重试

______ ______________ __

|_ _| \| | _ \ ____| \/ /

| || | | | |_) )____|) (

|___|_|\__|____/_____|_/\_\

+dwb===================dwb+

Stochastic Policy for Continuous Control

Policy Network for Continuous Control

Univariate Normal Distribution

Multivariate Normal Distribution

Continuous Control

Training Policy Network

Auxiliary Network

Policy Gradient Methods

______

|___|_|\|/___|_/\_\