这篇发表于2001年的文章,涉及到Information Theory里对Fisher Information Matrix的理解,对Policy Gradient的后作影响巨大,不得不提。关于Policy Gradient的基础,请移步这里

Policy Gradient's Average Reward Form:

θη(θ)=s,adπ(s)θπθ(s,a)Qπ(s,a)\nabla_\theta\eta(\theta) = \sum_{s,a}d^\pi(s)\nabla_\theta\pi_\theta(s,a)Q^\pi(s,a)

我们知道,对于一个含参数的policy πθ\pi_\thetaθ\theta 的很小改变,可能会导致policy的很大改变,所以只有当θ\theta的变化量非常小时,policy的提升才有保证。那么如何定义小?多小算小呢?

对于如何定义小的问题,比较直观的办法是计算 θ\theta 的变化量,比如其δθ|\delta \theta|;而更好的办法是计算更新前后的policy 的分布的距离,比如DKL(πθπθ+δθ)D_{KL}(\pi_\theta || \pi_{\theta+\delta\theta})。这样,我们就可以将policy optimization转化为一个带约束条件的优化问题,即:

findδθmaximizeη(πθ+δθ)subject toDKL(πθπθ+δθ)ϵ\text{find} \quad \delta\theta \quad \\ \text{maximize} \quad \eta(\pi_{\theta+\delta\theta}) \\ \text{subject to} \quad D_{KL}(\pi_\theta || \pi_{\theta+\delta\theta}) \leq \epsilon

解决这种问题,必然用拉格朗日法:

L(δθ,γ)=η(πθ+δθ)γ(DKL(πθπθ+δθ)ϵ)L(\delta\theta, \gamma) = \eta(\pi_{\theta+\delta\theta}) - \gamma(D_{KL}(\pi_\theta||\pi_{\theta+\delta\theta})-\epsilon)

对其求导求零点:

δθL(δθ,γ)=δθη(πθ+δθ)γδθDKL(πθπθ+δθ)=0\nabla_{\delta\theta}L(\delta\theta,\gamma)= \nabla_{\delta\theta} \eta(\pi_{\theta+\delta\theta})-\gamma\nabla_{\delta\theta}D_{KL}(\pi_\theta||\pi_{\theta+\delta\theta}) = 0

γL(δθ,γ)=ϵDKL(πθπθ+δθ)=0\nabla_{\gamma}L(\delta\theta,\gamma)= \epsilon - D_{KL}(\pi_\theta||\pi_{\theta+\delta\theta}) =0

接下来,我们用近似法求解上式。

首先对η(πθ+δθ)\eta(\pi_{\theta+\delta\theta})进行一阶泰勒展开:

η(πθ+δθ)η(θ)+δθTθη(πθ)\eta(\pi_{\theta+\delta\theta})\approx\eta(\theta)+\delta\theta^T\nabla_\theta\eta(\pi_\theta)

因此

δθη(πθ+δθ)θη(πθ)\nabla_{\delta\theta} \eta(\pi_{\theta+\delta\theta}) \approx \nabla_\theta\eta(\pi_\theta)

同样我们可以对DKLD_{KL}进行二阶泰勒展开:

DKL(πθπθ+δθ)DKL(πθπθ)+δθTθDKL(πθπθ+δθ)+12δθTθ2DKL(πθπθ+δθ)δθD_{KL}(\pi_\theta || \pi_{\theta+\delta\theta}) \approx D_{KL}(\pi_\theta||\pi_\theta)+\delta\theta^T\nabla_\theta D_{KL}(\pi_\theta||\pi_{\theta+\delta\theta})+\frac{1}{2}\delta\theta^T\nabla_\theta^2D_{KL}(\pi_\theta||\pi_{\theta+\delta\theta})\delta\theta

我们知道对于两个分布P(x),Q(x)P(x),Q(x),其KLKL散度:

DKL(PQ)=P(x)log(P(x)Q(x))dx=ExP(x)[logP(x)Q(x)]\begin{aligned} D_{KL}(P||Q) &= \int_{-\infty}^{\infty} P(x)\log(\frac{P(x)}{Q(x)})dx \\ &=\mathbb{E}_{x\sim P(x)}[\log\frac{P(x)}{Q(x)}] \end{aligned}

那么

θDKL(πθ0πθ)=θEaπθ0[logπθ0πθ]=Eaπθ0[logπθ]=Eaπθ0[1πθ0πθ]=Aπθ01πθ0πθ=Aπθ=0\begin{aligned} \nabla_\theta D_{KL}(\pi_{\theta_0}||\pi_{\theta})&= \nabla_\theta\mathbb{E}_{a\sim\pi_{\theta_0}}[\log \frac{\pi_{\theta_0}}{\pi_{\theta}}] \\ &= - \mathbb{E}_{a\sim\pi_{\theta_0}}[\nabla \log\pi_{\theta}] \\ &= - \mathbb{E}_{a\sim\pi_{\theta_0}}[\frac{1}{\pi_{\theta_0}}\nabla \pi_{\theta}] \\ &= -\int_{\mathcal{A}} \pi_{\theta_0}\frac{1}{\pi_{\theta_0}}\nabla\pi_\theta \\ &= -\nabla \int_{\mathcal{A}}\pi_\theta \\ &= 0 \end{aligned}

同理

θ2DKL(πθ0πθ)=Eaπθ0[θ2logπθ]=Eaπθ0[θ(πθπθ0)]=Eaπθ0[θ2πθπθ0θπθθπθTπθ02]=Eaπθ0[θlogπθθlogπθT]\begin{aligned} \nabla_\theta^2 D_{KL}(\pi_{\theta_0}||\pi_{\theta}) &= -\mathbb{E}_{a\sim\pi_{\theta_0}}[\nabla^2_\theta\log \pi_{\theta}] \\ &= -\mathbb{E}_{a\sim\pi_{\theta_0}}[\nabla_\theta (\frac{\nabla \pi_{\theta}}{\pi_{\theta_0}})] \\ &= -\mathbb{E}_{a\sim\pi_{\theta_0}}[\frac{\nabla^2_\theta \pi_{\theta} \pi_{\theta_0} - \nabla_\theta\pi_\theta \nabla_\theta\pi_\theta^T}{\pi_{\theta_0}^2}] \\ &= \mathbb{E}_{a\sim \pi_{\theta_0}}[\nabla_\theta\log\pi_\theta \nabla_\theta\log\pi_\theta^T] \end{aligned}

我们记Fisher Information Matrix

Fθ=Eaπθ[θlogπθθlogπθT]F_\theta = \mathbb{E}_{a\sim\pi_\theta}[\nabla_\theta\log\pi_\theta \nabla_\theta\log\pi_\theta^T]

因此

DKL(πθπθ+δθ)12δθTFθδθD_{KL}(\pi_\theta || \pi_{\theta+\delta\theta}) \approx \frac{1}{2}\delta\theta^TF_\theta\delta\theta

δθDKL(πθπθ+δθ)Fθδθ\nabla_{\delta\theta}D_{KL}(\pi_\theta||\pi_{\theta+\delta\theta}) \approx F_\theta\delta\theta

我们将上式代入L\nabla L可得:

δθL(δθ,γ)=θη(πθ)γFθδθ=0\nabla_{\delta\theta}L(\delta\theta,\gamma)=\nabla_\theta\eta(\pi_\theta) - \gamma F_\theta\delta\theta = 0

γL(δθ,γ)=12δθTFθδθϵ=0\nabla_{\gamma}L(\delta\theta,\gamma) = \frac{1}{2}\delta\theta^TF_\theta\delta\theta - \epsilon = 0

那么

δθ=1γFθ1θη(πθ)\delta\theta = \frac{1}{\gamma}F^{-1}_\theta\nabla_\theta\eta(\pi_\theta) \\

1γ2θη(πθ)TFθ1FθFθ1θη(πθ)=1γ2θη(πθ)TFθ1θη(πθ)=2ϵ\frac{1}{\gamma^2}\nabla_\theta\eta(\pi_\theta)^TF_\theta^{-1}F_\theta F_\theta^{-1}\nabla_\theta \eta(\pi_\theta) = \frac{1}{\gamma^2}\nabla_\theta \eta(\pi_\theta)^TF_\theta^{-1}\nabla_\theta \eta(\pi_\theta) = 2\epsilon

δθ=2ϵθη(πθ)TFθ1θη(πθ)Fθ1θη(πθ)\delta\theta = \sqrt{\frac{2\epsilon}{\nabla_\theta \eta(\pi_\theta)^TF_\theta^{-1}\nabla_\theta \eta(\pi_\theta)}} F_\theta^{-1}\nabla_\theta\eta(\pi_\theta)

我们记standard policy gradient g=θη(πθ)g = \nabla_\theta \eta(\pi_\theta),那么

Natural policy gradient

gN=2ϵgTFθ1gFθ1gg_N = \sqrt{\frac{2\epsilon}{g^TF_\theta^{-1}g}}F_\theta^{-1}g


所以我们发现,natural policy gradient的step size及step direction都和fisher information matrix关系很大,那么理解fisher information matrix就变得至关重要。关于费雪信息的详细分析,建议移步这里,我这里简单的说明一下。

我们前面推导出以下:

Fθ=Eaπθ[θlogπθθlogπθT]θ2DKL(π0πθ)\begin{aligned} F_\theta &= \mathbb{E}_{a\sim\pi_\theta}[\nabla_\theta\log\pi_\theta \nabla_\theta\log\pi_\theta^T] \\ &\approx \nabla^2_\theta D_{KL}(\pi_0||\pi_\theta) \end{aligned}

我么称policy log likelihood的一阶导数为score function:

Sθ=θlogπθS_\theta = \nabla_\theta \log \pi_\theta

那么:

Fθ=Eaπθ[SθTSθ]F_\theta = \mathbb{E}_{a\sim\pi_\theta}[S_\theta^TS_\theta]

θR\theta \in \mathbb{R}时:

Var(Sθ)=E[Sθ2]E2[Sθ]E[Sθ]=Aθlogπθ=0Var(S_\theta) = \mathbb{E}[S_\theta^2] - \mathbb{E}^2[S_\theta] \\ \mathbb{E}[S_\theta] = \int_\mathcal{A}\nabla_\theta\log\pi_\theta = 0 \\

所以:

Fθ=Var(Sθ)F_\theta = Var(S_\theta)

而:

Sθ=θlogπθ(s0,a0)++θlogπθ(sN,aN)S_\theta = \nabla_\theta \log \pi_\theta(s_0, a_0)+\dots+\nabla_\theta \log \pi_\theta(s_N, a_N)

也就是说,fisher information,等价于score function的方差。随着我们收集的数据点越来越多,NN 越来越大,很显然SθS_\theta的方差会越来越大,所以FθF_\theta会越来越大。反过来说

Fisher information measures how many data we have collected, in the sense how accurate our estimation is.

我们在推导θ2DKL\nabla^2_\theta D_{KL} 的时候间接证明了:

Fθ=Eaπθ[θ2logπθ]F_\theta = -\mathbb{E}_{a\sim \pi_\theta}[\nabla^2_\theta \log\pi_\theta]

也就是说,

Fisher information measures the curvature of log likelihood function.

而当 θRN,N>1\theta \in \mathbb{R}^N, N > 1时,同理可知

Fθ=Cov(Sθ)F_\theta = Cov(S_\theta)


这时候,我们再过来看natural policy gradient:

gN=αFθ1gg_N = \alpha F_\theta^{-1}g

也就是说,natural policy gradient是对standard policy gradient的修正,修正的方向与policy的score function的covariance matrix有关。只有当Fθ=IF_\theta = I,我们才不做修正。当 θ\theta 中的某两个元素正相关性高时候,我们的修正会其对应方向的梯度减小,反过来会增大。也就是说:

Natural policy gradient corrects standard policy gradient according to its parameter correlation in action space measure such that policy iterates in orthogonal action space.