Multi-Agent 强化学习

Setting

Multi-agent 有常见以下 4 种关系设定:

  1. Fully cooperative: Agent collaborate to optimize a common return.

  2. Fully competitive: One agent's gain is the other agent's loss.

  3. Mixed cooperative & competitive: There are both cooperative setting and competitive setting.

  4. Self-interested: Agents are self-interested. Their rewards may or may not conflict.

专业术语

State, Action, State Transition

  • There are \(n\) agents.

  • Let \(S\) be the state.

  • Let \(A^i\) be the \(i\)-th agent's action.

  • State transition:

\[ p(s' ∣ s, a^1, ⋯, a^n) = ℙ(S' = s' ∣ S = s, A^1 = a^1, ⋯, A^n = a^n). \]
  • The next state \(S'\) depends on all the agents' actions.

Rewards

  • Let \(R^i\) be the reward received by the \(i\)-th agent.

  • Fully cooperative: \(R^1 = R^2 = ⋯ = R^n\)

  • Fully competitive: \(R^1 ∝ -R^2\)

  • \(R^i\) depends on \(A^i\) as well as all the other agents' actions \(\{A^j\}_{j ≠ i}\).

Returns

  • Let \(R_t^i\) be the reward received by the \(i\)-th agent at time \(t\).

  • Return (of the \(i\)-th agent):

\[ U_t^i = R_t^i + R_{t+1}^i + R_{t+2}^i + R_{t+3}^i + ⋯ \]
  • Discounted return (of the \(i\)-th agent):

\[ U_t^i = R_t^i + γ R_{t+1}^i + γ^2 R_{t+2}^i + γ^3 R_{t+3}^i + ⋯ \]

Here, \(γ ∈ [0,1]\) is the discount rate.

Policy Network

  • Each agent has its own policy network: \(π(a^i ∣ s; 𝛉^i).\)

  • Policy networks can be exchangeable: \(𝛉^1 = 𝛉^2 = ⋯ = 𝛉^n\)

  • Self-driving cars can have the same policy.

  • Policy networks can be nonexchangeable: \(𝛉^i ≠ 𝛉^j.\)

  • Soccer players have different roles, e.g., striker, defender, goalkeeper.

Uncertainty in the Return

  • The reward \(R_t^i\) depends on \(S_t\) and \(A_t^1, A_t^2, ⋯, A_t^n\).

  • Uncertainty in \(S_t\) is from the state transition, \(p\).

  • Uncertainty in \(A_t^i\) is from the policy network: \(π(⋅ ∣ S_t; 𝛉^i)\)

  • The return: \(U_t^i = ∑\limits_{k=0}^{∞} γ^k ⋅ R_{t+k}^i\) depends on:

  • all the future states: \(\{S_t, S_{t+1}, S_{t+2}, ⋯\}\);

  • all the future actions: \(\{A_t^i, A_{t+1}^i, A_{t+2}^i, ⋯\}\), for all \(i = 1, ⋯, n\).

State-Value Function

  • State-value of the \(i\)-th agent:

\[ V^i(s_t; 𝛉^1, ⋯, 𝛉^n) = \mathbb{E}\big[U_t^i \mid S_t = s_t\big]. \]
  • The expectation is taken w.r.t. all the future actions and states except \(S_t\).

  • Randomness in actions: \(A_t^j \sim \pi(\cdot \mid s_t; 𝛉^j)\), for all \(j = 1, ⋯, n\).

(That is why the state-value \(V^i\) depends on \(𝛉^1, ⋯, 𝛉^n\).)

  • One agent’s state-value, \(V^i(s; 𝛉^1, ⋯, 𝛉^n)\), depends on all the agents’ policies.

  • If any agent changes its policy, then all of \(V^1, ⋯, V^n\) can change.

  • Example: soccer game.

  • A striker improves his policy, while everyone else’s policies are fixed.

  • His teammates’ state-values all increase.

  • The opposing players’ state-values all decrease.

Convergence

Convergence收敛):无法通过改进策略来获得更大的期望回报。如果所有 agents 都找不到更好的策略,就说明已经收敛,可以中止训练了。

回顾一下 single-agent policy learning :

  • Policy network: \(\pi(a ∣ s; 𝛉)\).

  • State-value function: \(V(s; 𝛉)\).

  • \(J(𝛉) = 𝔼_S \left[ V(S; 𝛉) \right]\) evaluates how good the policy is.

  • Learn the policy network’s parameter, \(𝛉\), by \(\max\limits_{𝛉} \; J(𝛉)\).

  • Convergence: \(J(𝛉)\) stops increasing.

如果有多个 agents ,那么判断收敛的标准就是纳什均衡Nash Equilibrium ):

  • While all the other agents’ policy remain the same, the \(i\)-th agent cannot get better expected return by changing its own policy.

  • Every agent is playing a best-response to the other agents’ policies.

  • Nash equilibrium indicates convergence because no one has any incentive to deviate.

Difficulty of MARL

多个 agents 会令训练变得更困难,直接套用 single-agent 的算法效果不好,可能不会收敛,因为:

  • The \(i\)-th agent’s policy network: \(\pi(a^i \mid s; 𝛉^i)\).

  • The \(i\)-th agent’s state-value function: \(V^i(s; 𝛉^1, ⋯, 𝛉^n)\).

  • Objective function:

\[ J^i(𝛉^1, ⋯, 𝛉^n) = 𝔼_S \left[ V^i(S; 𝛉^1, ⋯, 𝛉^n) \right] \]
  • Learn the policy network’s parameter, \(𝛉^i\), by

\[ \max_{𝛉^i} \; J^i(𝛉^1, ⋯, 𝛉^n) \]

注意到不同的 agent 的目标 \(\max_{𝛉^i} \; J^i(𝛉^1, ⋯, 𝛉^n)\) 并不相同:

\[ \begin{matrix} \\ \text{The 1st agent solves:} & \max\limits_{𝛉^1} \; J^1(\underline{𝛉^1}, 𝛉^2, ⋯, 𝛉^n). \\ \text{The 2nd agent solves:} & \max\limits_{𝛉^2} \; J^2(𝛉^1, \underline{𝛉^2}, ⋯, 𝛉^n). \\ \vdots \\ \text{The }n\text{th agent solves:} & \max\limits_{𝛉^n} \; J^n(𝛉^1, 𝛉^2, ⋯, \underline{𝛉^n}). \end{matrix} \]

这些 agents 各自有各自的目标函数,没有共同的目标,各自更新各自的参数。

这可能导致永远无法收敛。因为一个 agent 的更新策略会导致其他所有 agents 的目标函数都发生变化,例如:

  • The \(i\)-th agent found \(𝛉_\star^i = \arg\max_{𝛉^i} \; J^i(𝛉^1, \cdots, 𝛉^n)\).

  • Now, another agent changes its policy.

  • So \(𝛉_\star^i\) is no longer the best policy of the \(i\)-th agent. The \(i\)-th agent has to find a new \(𝛉^i\).

  • The other agents’ objective functions will change, and therefore they will change their policies...

就这样所有的 agents 都在不停地改变自己的策略,可能永远都无法收敛。

Centralized and Decentralized

由于一个 agent 的更新策略会导致其他所有 agents 的目标函数都发生变化,因而需要令 agents 之间能够进行通信。通信方式有两种:Centralized ,中心化,和 Decentralized ,去中心化。

  • Fully decentralized: Every agent uses its own observations and rewards to learn its policy. Agents do not communicate.

  • Fully centralized: The agents send everything to the central controller. The controller makes decisions for all the agents.

Agent 本身只负责发起动作,不负责决策;决策由 central controller 负责。

  • Centralized training with decentralized execution: A central controller is used during training. The controller is disabled after training.

Partial Observations

MARL 通常假设 Partial Observations ,部分观测,这是因为 agent 往往只能观测到局部的 state 而看不到全局的 state 。

  • An agent may or may not have full knowledge of the state, \(s\).

  • Let \(o^i\) be the \(i\)-th agent’s observation.

  • Partial observation: \(o^i \ne s\).

  • Full observation: \(o^1 = ⋯ = o^n = s\).

Fully Decentralized

训练过程:(和 single-agent RL 的训练过程完全一样)

\[ \begin{matrix} \begin{matrix} \\ \text{Agent} 1 & \text{Agent} 2 & ⋯ & \text{Agent} n \\a^1 ↑↓ o^1, r^1 & a^2 ↑↓ o^2, r^2 & & a^n ↑↓ o^n, r^n \end{matrix} \\ Environment \end{matrix} \]

训练结束之后,每个 agent 使用各自的策略网络来做决策。无论是训练还是推理,agent 之间

\[ \begin{matrix} \begin{matrix} \\ a ^ 1 ∼ π(⋅ ∣ o ^ 1; 𝛉 ^ 1) & a ^ 2 ∼ π(⋅ ∣ o ^ 2; 𝛉 ^ 2) & & a ^ n ∼ π(⋅ ∣ o ^ n; 𝛉 ^ n) \\ \text{Agent} 1 & \text{Agent} 2 & ⋯ & \text{Agent} n \\a^1 ↑↓ o^1 & a^2 ↑↓ o^2 & & a^n ↑↓ o^n \end{matrix} \\ Environment \end{matrix} \]
  • The \(i\)-th agent has a policy network (actor): \(\pi(a^i ∣ o^i; 𝛉^i)\).

  • The \(i\)-th agent has a value network (critic): \(q(o^i, a^i; \mathbf{w}^i)\).

  • Agents do not share observations and actions.

  • Train the policy and value networks in the same way as the single-agent setting.

  • This does not work well.

Fully Centralized

Centralized Training:

\[ \text{Central Controller} \left\{ \begin{matrix} \underset{a^1}{\overset{o^1, r^1}{⇆}} & \text{Agent} 1 ↔ \\ \underset{a^2}{\overset{o^2, r^2}{⇆}} & \text{Agent} 2 ↔ \\ ⋯ \\ \underset{a^n}{\overset{o^n, r^n}{⇆}} & \text{Agent} n ↔ \end{matrix} \right\} \text{Environment} \]

Centralized Execution:

\[ \overset{π(a ^ i ∣ o ^ 1, ⋯, o ^ n; 𝛉 ^ i) \text{ for all } i = 1, 2, ⋯, n} {\text{Central Controller}} \left\{ \begin{matrix} \underset{a^1}{\overset{o^1, r^1}{⇆}} & \text{Agent} 1 ↔ \\ \underset{a^2}{\overset{o^2, r^2}{⇆}} & \text{Agent} 2 ↔ \\ ⋯ \\ \underset{a^n}{\overset{o^n, r^n}{⇆}} & \text{Agent} n ↔ \end{matrix} \right\} \text{Environment} \]

比如 Centralized 应用于 actor-critic method :

  • Let \(𝐚 = [a^1, a^2, ⋯, a^n]\) contain all the agents’ actions.

  • Let \(𝐨 = [o^1, o^2, ⋯, o^n]\) contain all the agents’ observations.

  • The central controller knows \(𝐚\), \(𝐨\), and all the rewards.

  • The controller has \(n\) policy networks and \(n\) value networks:

  • Policy network (actor) for the \(i\)-th agent: \(\pi(a^i ∣ 𝐨; 𝛉^i)\).

  • Value network (critic) for the \(i\)-th agent: \(q(𝐨, 𝐚; 𝐰^i)\).

  • Centralized Training: Training is performed by the controller.

  • The controller knows all the observations, actions, and rewards.

  • Train \(\pi(a^i ∣ 𝐨; 𝛉^i)\) using policy gradient.

  • Train \(q(𝐨, 𝐚; 𝐰^i)\) using TD algorithm.

  • Centralized Execution: Decisions are made by the controller.

  • For all \(i\), the \(i\)-th agent sends its observation, \(o^i\), to the controller.

  • The controller knows \(𝐨 = [o^1, o^2, ⋯, o^n]\).

  • For all \(i\), the controller samples action by \(a^i ∼ π(⋅ ∣ 𝐨; 𝛉^i)\) and sends \(a^i\) to the \(i\)-th agent.

Fully Centralized 的缺点是执行速度慢:

  • All the agents send their observations to the central controller.

  • The central controller makes decisions, \(\mathbf{a} = [a^1, a^2, \cdots, a^n]\), and sends \(a^i\) to the \(i\)-th agent.

王树森讲到这里的时候说了一句「中央亲自部署,亲自指挥……」,因此这一集的 AI 中文字幕没了,实在令環状線忍俊不禁。该句出现在这一集视频的 8 分 54 秒
  • Communication and synchronization cost time.

  • Real-time decision is impossible.

Centralized Training with Decentralized Execution

比如对于 actor-critic method :

  • Each agent has its own policy network (actor): \(π(a^i ∣ o^i; 𝛉^i)\).

  • The central controller has \(n\) value networks (critics): \(q(𝐨, 𝐚; 𝐰^i)\).

  • Centralized Training: During training, the central controller knows all the agents’ observations, actions, and rewards.

  • Decentralized Execution: During execution, the central controller and its value networks are not used.

Centralized Training:

\[ \left. \begin{matrix} \text{Critic }1 \\ \text{Critic }2 \\ ⋯ \\ \text{Critic }n \end{matrix} \right\} ← \overset{\{a ^ i, o ^ i, r ^ i\}^n_{i = 1}} {\text{Central Controller}} \left\{ \begin{matrix} \overset{a^1 ,o^1, r^1}{←} & \text{Agent} 1 & \underset{a^1}{\overset{o^1, r^1}{⇆}} \\ \overset{a^2 ,o^2, r^2}{←} & \text{Agent} 2 & \underset{a^2}{\overset{o^2, r^2}{⇆}} \\ & ⋯ \\ \overset{a^n ,o^n, r^n}{←} & \text{Agent} n & \underset{a^n}{\overset{o^n, r^n}{⇆}} \end{matrix} \right\} \text{Environment} \]
  • The central controller trains the critics, \(q(\mathbf{o}, \mathbf{a}; 𝐰^i)\), for all \(i\).

  • To update \(𝐰^i\), TD algorithm takes as inputs:

  • All the actions: \(\mathbf{a} = [a^1, a^2, \cdots, a^n]\).

  • All the observations: \(\mathbf{o} = [o^1, o^2, \cdots, o^n]\).

  • The \(i\)-th reward: \(r^i\).

\[ \overset{\{a ^ i, o ^ i, r ^ i\}^n_{i = 1}} {\text{Central Controller}} → \left\{ \begin{matrix} \text{Critic }1 & \overset{q ^ 1 = q(𝐨, 𝐚; 𝐰 ^ 1)}{→} \text{Actor }1 \\ \text{Critic }2 & \overset{q ^ 2 = q(𝐨, 𝐚; 𝐰 ^ 2)}{→} \text{Actor }2 \\ ⋯ \\ \text{Critic }n & \overset{q ^ n = q(𝐨, 𝐚; 𝐰 ^ n)}{→} \text{Actor }n \end{matrix} \right. \]
  • Each agent locally trains the actor, \(\pi(a^i \mid o^i; 𝛉^i)\), using policy gradient.

  • To update \(𝛉^i\), the policy gradient algorithm takes as input \((a^i, o^i, q^i)\).

训练之后就不再需要中央控制器了,每个 agent 独立跟环境交互。

Decentralized Execution:

\[ \left. \begin{matrix} a ^ 1 ∼ π(⋅ ∣ o ^ 1; 𝛉 ^ 1) & \text{Agent} 1 & \underset{a^1}{\overset{o^1}{⇆}} \\ a ^ 2 ∼ π(⋅ ∣ o ^ 2; 𝛉 ^ 2) & \text{Agent} 2 & \underset{a^2}{\overset{o^2}{⇆}} \\ & ⋯ \\ a ^ n ∼ π(⋅ ∣ o ^ n; 𝛉 ^ n) & \text{Agent} n & \underset{a^n}{\overset{o^n}{⇆}} \end{matrix} \right\} \text{Environment} \]

Parameter Sharing

  • Policy networks: \(π(a^i \mid o^i; 𝛉^i)\), for \(i = 1, 2, ⋯, n\).

  • Value networks: \(q(𝐨, 𝐚; 𝐰^i)\), for \(i = 1, 2, ⋯, n\).

  • Trainable parameters: \(\{𝛉^i, 𝐰^i\}_{i=1}^n\).

  • Parameter sharing: \(𝛉^i = 𝛉^j\) and \(𝐰^i = 𝐰^j\), for some \(i\) and \(j\).

那么,是否应该共享参数呢?这取决于 agents 之间是否 exchangeable 。如果 exchangeable ,则可以共享参数;反之则不能。

作于 2026-4-9