这篇发表于2001年的文章,涉及到Information Theory里对Fisher Information Matrix的理解,对Policy Gradient的后作影响巨大,不得不提。关于Policy Gradient的基础,请移步这里

Policy Gradient’s Average Reward Form:

$$\nabla_\theta\eta(\theta) = \sum_{s,a}d^\pi(s)\nabla_\theta\pi_\theta(s,a)Q^\pi(s,a)$$

我们知道,对于一个含参数的策略 $\pi_\theta$, $\theta$ 的很小改变,可能会导致策略的很大改变,所以只有当的$\theta$变化量非常小时,策略的提升才有保证。那么如何定义小?多小算小呢?

对于如何定义小的问题,比较直观的办法是计算 $\theta$ 的变化量,比如其$|\delta \theta|$;而更好的办法是计算更新前后的policy 的分布的距离,比如$D_{KL}(\pi_\theta || \pi_{\theta+\delta\theta})$。这样,我们就可以将policy optimisation转化为一个带约束条件的优化问题,即:

$$\text{find} \quad \delta\theta \quad \\\text{maximize} \quad \eta(\pi_{\theta+\delta\theta}) \\\text{subject to} \quad D_{KL}(\pi_\theta || \pi_{\theta+\delta\theta}) \leq \epsilon$$

解决这种问题,必然用拉格朗日法:

$$L(\delta\theta, \gamma) = \eta(\pi_{\theta+\delta\theta}) - \gamma(D_{KL}(\pi_\theta||\pi_{\theta+\delta\theta})-\epsilon)$$

对其求导求零点:

$$\nabla_{\delta\theta}L(\delta\theta,\gamma)= \nabla_{\delta\theta} \eta(\pi_{\theta+\delta\theta})-\gamma\nabla_{\delta\theta}D_{KL}(\pi_\theta||\pi_{\theta+\delta\theta}) = 0 \\ \nabla_{\gamma}L(\delta\theta,\gamma)= \epsilon - D_{KL}(\pi_\theta||\pi_{\theta+\delta\theta}) =0$$

接下来,我们用近似法求解上式。

首先对$\eta(\pi_{\theta+\delta\theta})$进行一阶泰勒展开:

$$\eta(\pi_{\theta+\delta\theta})\approx\eta(\theta)+\delta\theta^T\nabla_\theta\eta(\pi_\theta) \\\nabla_{\delta\theta} \eta(\pi_{\theta+\delta\theta}) \approx \nabla_\theta\eta(\pi_\theta)$$

同样我们可以对$D_{KL}$进行二阶泰勒展开:

$$D_{KL}(\pi_\theta || \pi_{\theta+\delta\theta}) \approx D_{KL}(\pi_\theta||\pi_\theta)+\delta\theta^T\nabla_\theta D_{KL}(\pi_\theta||\pi_{\theta+\delta\theta})+\frac{1}{2}\delta\theta^T\nabla_\theta^2D_{KL}(\pi_\theta||\pi_{\theta+\delta\theta})\delta\theta$$

我们知道对于两个分布$P(x),Q(x)$,其KL散度:

$$\begin{aligned}D_{KL}(P||Q) &= \int_{-\infty}^{\infty} P(x)\log(\frac{P(x)}{Q(x)})dx \\&=\mathbb{E}_{x\sim P(x)}[\log\frac{P(x)}{Q(x)}]\end{aligned}$$

那么

$$\begin{aligned}\nabla_\theta D_{KL}(\pi_{\theta_0}||\pi_{\theta})&= \nabla_\theta\mathbb{E}_{a\sim\pi_{\theta_0}}[\log \frac{\pi_{\theta_0}}{\pi_{\theta}}] \\&= - \mathbb{E}_{a\sim\pi_{\theta_0}}[\nabla \log\pi_{\theta}] \\&= - \mathbb{E}_{a\sim\pi_{\theta_0}}[\frac{1}{\pi_{\theta_0}}\nabla \pi_{\theta}] \\&= -\int_{\mathcal{A}} \pi_{\theta_0}\frac{1}{\pi_{\theta_0}}\nabla\pi_\theta \\&= -\nabla \int_{\mathcal{A}}\pi_\theta \\&= 0\end{aligned}$$

同理

$$\begin{aligned}\nabla_\theta^2 D_{KL}(\pi_{\theta_0}||\pi_{\theta})&= -\mathbb{E}_{a\sim\pi_{\theta_0}}[\nabla^2_\theta\log \pi_{\theta}] \\&= -\mathbb{E}_{a\sim\pi_{\theta_0}}[\nabla_\theta (\frac{\nabla \pi_{\theta}}{\pi_{\theta_0}})] \\&= -\mathbb{E}_{a\sim\pi_{\theta_0}}[\frac{\nabla^2_\theta \pi_{\theta} \pi_{\theta_0} - \nabla_\theta\pi_\theta \nabla_\theta\pi_\theta^T}{\pi_{\theta_0}^2}] \\&= \mathbb{E}_{a\sim \pi_{\theta_0}}[\nabla_\theta\log\pi_\theta \nabla_\theta\log\pi_\theta^T]\end{aligned}$$

我们记Fisher Information Matrix

$$F_\theta = \mathbb{E}_{a\sim\pi_\theta}[\nabla_\theta\log\pi_\theta \nabla_\theta\log\pi_\theta^T]$$

因此

$$D_{KL}(\pi_\theta || \pi_{\theta+\delta\theta}) \approx \frac{1}{2}\delta\theta^TF_\theta\delta\theta \\ \nabla_{\delta\theta}D_{KL}(\pi_\theta||\pi_{\theta+\delta\theta}) \approx F_\theta\delta\theta$$

我们将上式代入$\nabla L$可得:

$$\nabla_{\delta\theta}L(\delta\theta,\gamma)=\nabla_\theta\eta(\pi_\theta) - \gamma F_\theta\delta\theta = 0 \\ \nabla_{\gamma}L(\delta\theta,\gamma) = \frac{1}{2}\delta\theta^TF_\theta\delta\theta - \epsilon = 0$$

那么

$$\delta\theta = \frac{1}{\gamma}F^{-1}_\theta\nabla_\theta\eta(\pi_\theta) \\$$
$$\frac{1}{\gamma^2}\nabla_\theta\eta(\pi_\theta)^TF_\theta^{-1}F_\theta F_\theta^{-1}\nabla_\theta \eta(\pi_\theta) = \frac{1}{\gamma^2}\nabla_\theta \eta(\pi_\theta)^TF_\theta^{-1}\nabla_\theta \eta(\pi_\theta) = 2\epsilon$$
$$\delta\theta = \sqrt{\frac{2\epsilon}{\nabla_\theta \eta(\pi_\theta)^TF_\theta^{-1}\nabla_\theta \eta(\pi_\theta)}} F_\theta^{-1}\nabla_\theta\eta(\pi_\theta)$$

我们记standard policy gradient $g = \nabla_\theta \eta(\pi_\theta)$,那么

Natural policy gradient:

$$g_N = \sqrt{\frac{2\epsilon}{g^TF_\theta^{-1}g}}F_\theta^{-1}g$$

所以我们发现,natural policy gradient的step size及step direction都和fisher information matrix关系很大,那么理解fisher information matrix就变得至关重要。关于费雪信息的详细分析,建议移步这里,我这里简单的说明一下。

我们前面推导出以下:

$$\begin{aligned}F_\theta &= \mathbb{E}_{a\sim\pi_\theta}[\nabla_\theta\log\pi_\theta \nabla_\theta\log\pi_\theta^T] \\&\approx \nabla^2_\theta D_{KL}(\pi_0||\pi_\theta)\end{aligned}$$

我么称policy log likelihood的一阶导数为score function:

$$S_\theta = \nabla_\theta \log \pi_\theta$$

那么:

$$F_\theta = \mathbb{E}_{a\sim\pi_\theta}[S_\theta^TS_\theta]$$

当 $\theta \in \mathbb{R}$ 时:

$$Var(S_\theta) = \mathbb{E}[S_\theta^2] - \mathbb{E}^2[S_\theta] \\\mathbb{E}[S_\theta] = \int_\mathcal{A}\nabla_\theta\log\pi_\theta = 0 \\$$

所以

$$F_\theta = Var(S_\theta)$$

$$S_\theta = \nabla_\theta \log \pi_\theta(s_0, a_0)+\dots+\nabla_\theta \log \pi_\theta(s_N, a_N)$$

也就是说,fisher information,等价于score function的方差。随着我们收集的数据点越来越多,$N$ 越来越大,很显然$S_\theta$的方差会越来越大,所以$F_\theta$会越来越大。反过来说

Fisher information measures how many data we have collected, in the sense how accurate our estimation is.

我们在推导$\nabla^2_\theta D_{KL}$ 的时候间接证明了:

$$F_\theta = -\mathbb{E}_{a\sim \pi_\theta}[\nabla^2_\theta \log\pi_\theta]$$

也就是说,

Fisher information measures the curvature of log likelihood function.

而当 $\theta \in \mathbb{R}^N, N > 1$时,同理可知

$$F_\theta =\mathrm{Cov}(S_\theta)$$

这时候,我们再过来看natural policy gradient:

$$g_N = \alpha F_\theta^{-1}g$$

也就是说,natural policy gradient是对standard policy gradient的修正,修正的方向与policy的score function的covariance matrix有关。只有当$F_\theta = I$,我们才不做修正。当 $\theta$ 中的某两个元素正相关性高时候,我们的修正会其对应方向的梯度减小,反过来会增大。也就是说:

Natural policy gradient corrects standard policy gradient according to its parameter correlation in action space measure such that policy iterates in orthogonal action space.