______

|_ _| \| | _ \ ____| \/ /

| || | | | |_) )____|) (

|___|_|\|/___|_/\_\

+dwb===================dwb+

[<]サイバー環状線的随笔

ノート環状線

OpenClaw执行命令于Docker上

深度强化学习

初识深度强化学习 Value-based RL Policy-based RL Actor-Critic Method Alpha Go 蒙特卡洛算法 Experience Replay Dueling Network MARL Policy Gradient with Baseline Reinforce with Baseline A2C DPG Stochastic Policy for Continuous Control

通信视觉

WEB

改变 𝕏 网页端默认字体时需注入的 CSS

初识深度强化学习

专业术语

概率学上，通常以小写字母表示观测值，以大写字母表示随机变量。 \(p(x)\) 是概率密度函数（ PDF ）。对于 \(X \in \chi\)：若离散分布，期望 \(\mathbb{E}(f(X)) = \int_{\chi} p(x) \cdot f(x) dx\) 若离散分布，期望 \(\mathbb{E}(f(X)) = \sum_{x \in \chi} p(x) \cdot f(x)\)

State \(s\)

当前环境的状态。

Action \(a\)

发起的动作。

Agent

Action 的发出者，这里一般指智能体。

Policy \(\pi\)

根据观测到的 state 做出策略来控制 agent 发起 action 。

Policy 函数 \(\pi: (s, a) \mapsto [0, 1]\)：

\[ \pi(a | s) = \mathbb{P}(A = a | S = s) \]

\(\mathbb{P}(A = a | S = s)\) 指在 \(S = s\) 的前提下 \(A = a\) 的 PDF 。

\(\pi\) 可以是确定的（即在某个 state 下一定发起某个 action），也可以是随机的（即在某个 state 下有概率发起某个 action）。

Reward \(R\)

对 agent 发起 action 的奖励。奖励的值可正可负。奖励的值越大，越能激励 agent 发起对应的 action ，反之则 discourage agent 发起该 action 。

State Transition

\[ \mathrm{oldState} \overset{\mathrm{action}}{\rightarrow} \mathrm{newState} \]

这一过程称为状态转移。函数表示：

\[ p(s' | s, a) = \mathbb{P}(S' = s' | S = s, A = a) \]

AI 玩游戏的基本流程

观测到 state \(s_1\)；
发起 action \(a_1\)；
观测到新 state \(s_2\)，收到 reward \(r_1\)；
发起 action \(a_2\)；
...
游戏结束，轨迹 trajectory 为：(state, action, reward)

\[ s_1, a_1, r_1, s_2, a_2, r_2, \dots , s_T, a_T, r_T \]

Rewards and Returns

Returns (aka Cumulative Future Reward)

将 \(t\) 时刻的 returns 记作 \(U_t\)：

\[ U_t = R_t + R_{t + 1} + R_{t + 2} + R_{t + 3} + \dots \]

但 \(t + n\)（未来）时刻的 reward 更多时候不如 \(t\) 时刻的重要，所以需要 discount future reward 。

Discounted Returns (aka Cumulative Discounted Future Reward)

\[ U_t = R_t + \gamma R_{t + 1} + \gamma^2 R_{t + 2} + \gamma^3 R_{t + 3} + \dots \]

其中 \(\gamma\) 是 discount 率。由以上可知，给定 \(s_t\) ，\(U_t\) 取决于随机变量 \(A_t, A_{t + 1}, A_{t + 2}, A_{t + 3}, \dots \) 和 \(S_{t + 1}, S_{t + 2}, \dots \) 。

Value Functions

价值函数，为某个对象打分。

对于 policy \(\pi \) 的 Action-value Function

动作价值函数，为 action \(a_t\) 打分：

\[ Q_{\pi }(s_t, a_t) = \mathbb{E}(U_t | S_t = s_t, A_t = a_t) \]

\(Q_{\pi }\) 与 \(\pi , s_t, a_t\) 有关，与 \(A_{t + 1}, A_{t + 2}, A_{t + 3}, \dots \) 和 \(S_{t + 1}, S_{t + 2}, \dots \) 无关。

Optimal Action-value Function

最佳动作价值函数，用于判断在 state \(s_t\) 下哪个 action 最好：

\[ Q^*(s_t, a_t) = \underset{\pi }{\mathrm{max}}Q_{\pi }(s_t, a_t) \]

\(Q^*\) 与 \(\pi\) 无关。

State-value Function

状态价值函数，可以用于判断游戏当前局势的好坏：

\[ V_{\pi }(s_t) = \mathbb{E}_A(Q_{\pi }(s_t, A)) \]

\(\mathbb{E}_S(V_{\pi }(S))\) 可以用于评估 policy \(\pi \) 的好坏。

AI 如何控制 agent

两种思路：

假设已有好的 policy \(\pi (a | s)\)

观察 state \(s_t\)；
随机取样：\(a_t \sim \pi (\pi | s_t)\)。

假设已有 Optimal Action-value Function \(Q^*(s, a)\)

观察 state \(s_t\)；
选择使值 \(a_t = \mathrm{argmax}_a Q^*(s_t, a)\) 最大的 action 。

强化学习的任务就是学习 \(\pi (a | s)\) 或 \(Q^*(s, a)\) 两者之一。

Gym 入门

Gym is an open source Python library for developing and comparing reinforcement learning algorithms by providing a standard API to communicate between learning algorithms and environments, as well as a standard set of environments compliant with that API. ... To install the base Gym library, use pip install gym. This does not include dependencies for all families of environments (there's a massive number, and some can be problematic to install on certain systems). You can install these dependencies for one family like pip install gym[atari] or use pip install gym[all] to install all dependencies. We support Python 3.7, 3.8, 3.9 and 3.10 on Linux and macOS. We will accept PRs related to Windows, but do not officially support it.
Github@openai/gym

代码示例：

import gym
env = gym.make('CartPole-v0') # 参数为欲解决问题的名称，此处是解决 Cart Pole 问题
state = env.reset()

for t in range(100):
    env.render() # 弹出一个 Cart Pole 问题的窗口
    print(state)

    action = env.action_space.sample() # 一次随机 action
    state, reward, done, info = env.step(action)

    if done: # done == 1 时结束（赢下或输掉这次游戏）
        print('Finished')
        break

env.close

这里env提供了 state 和 reward 。

作于 2026-3-31

[<<]深度强化学习

Value-based 强化学习[>>]

|三

ノート環状線

[ERROR]连接出错，请重试

______

|_ _| \| | _ \ ____| \/ /

| || | | | |_) )____|) (

|___|_|\|/___|_/\_\

+dwb===================dwb+

初识深度强化学习

专业术语

State \(s\)

Action \(a\)

Agent

Policy \(\pi\)

Reward \(R\)

State Transition

AI 玩游戏的基本流程

Rewards and Returns

Returns (aka Cumulative Future Reward)

Discounted Returns (aka Cumulative Discounted Future Reward)

Value Functions

对于 policy \(\pi \) 的 Action-value Function

Optimal Action-value Function

State-value Function

AI 如何控制 agent

假设已有好的 policy \(\pi (a | s)\)

假设已有 Optimal Action-value Function \(Q^*(s, a)\)

Gym 入门

|三

ノート環状線

[ERROR]连接出错，请重试

______ ______________ __

|_ _| \| | _ \ ____| \/ /

| || | | | |_) )____|) (

|___|_|\__|____/_____|_/\_\

+dwb===================dwb+

初识深度强化学习

专业术语

State \(s\)

Action \(a\)

Agent

Policy \(\pi\)

Reward \(R\)

State Transition

AI 玩游戏的基本流程

Rewards and Returns

Returns (aka Cumulative Future Reward)

Discounted Returns (aka Cumulative Discounted Future Reward)

Value Functions

对于 policy \(\pi \) 的 Action-value Function

Optimal Action-value Function

State-value Function

AI 如何控制 agent

假设已有好的 policy \(\pi (a | s)\)

假设已有 Optimal Action-value Function \(Q^*(s, a)\)

Gym 入门

______

|___|_|\|/___|_/\_\