这篇发表于2001年的文章,涉及到Information Theory里对Fisher Information Matrix的理解,对Policy Gradient的后作影响巨大,不得不提。关于Policy Gradient的基础,请移步这里 。
Policy Gradient's Average Reward Form:
∇ θ η ( θ ) = ∑ s , a d π ( s ) ∇ θ π θ ( s , a ) Q π ( s , a ) \nabla_\theta\eta(\theta) = \sum_{s,a}d^\pi(s)\nabla_\theta\pi_\theta(s,a)Q^\pi(s,a) ∇ θ η ( θ ) = s , a ∑ d π ( s ) ∇ θ π θ ( s , a ) Q π ( s , a ) 我们知道,对于一个含参数的策略 π θ \pi_\theta π θ , θ \theta θ 的很小改变,可能会导致策略的很大改变,所以只有当的θ \theta θ 变化量非常小时,策略的提升才有保证。那么如何定义小?多小算小呢?
对于如何定义小的问题,比较直观的办法是计算 θ \theta θ 的变化量,比如其∥ δ θ ∥ \|\delta \theta\| ∥ δ θ ∥ ;而更好的办法是计算更新前后的policy 的分布的距离,比如D K L ( π θ ∣ ∣ π θ + δ θ ) D_{KL}(\pi_\theta || \pi_{\theta+\delta\theta}) D K L ( π θ ∣∣ π θ + δ θ ) 。这样,我们就可以将policy optimisation转化为一个带约束条件的优化问题,即:
find δ θ maximize η ( π θ + δ θ ) subject to D K L ( π θ ∣ ∣ π θ + δ θ ) ≤ ϵ \text{find} \quad \delta\theta \quad \\\text{maximize} \quad \eta(\pi_{\theta+\delta\theta}) \\\text{subject to} \quad D_{KL}(\pi_\theta || \pi_{\theta+\delta\theta}) \leq \epsilon find δ θ maximize η ( π θ + δ θ ) subject to D K L ( π θ ∣∣ π θ + δ θ ) ≤ ϵ 解决这种问题,必然用拉格朗日法:
L ( δ θ , γ ) = η ( π θ + δ θ ) − γ ( D K L ( π θ ∣ ∣ π θ + δ θ ) − ϵ ) L(\delta\theta, \gamma) = \eta(\pi_{\theta+\delta\theta}) - \gamma(D_{KL}(\pi_\theta||\pi_{\theta+\delta\theta})-\epsilon) L ( δ θ , γ ) = η ( π θ + δ θ ) − γ ( D K L ( π θ ∣∣ π θ + δ θ ) − ϵ ) 对其求导求零点:
∇ δ θ L ( δ θ , γ ) = ∇ δ θ η ( π θ + δ θ ) − γ ∇ δ θ D K L ( π θ ∣ ∣ π θ + δ θ ) = 0 ∇ γ L ( δ θ , γ ) = ϵ − D K L ( π θ ∣ ∣ π θ + δ θ ) = 0 \nabla_{\delta\theta}L(\delta\theta,\gamma)= \nabla_{\delta\theta} \eta(\pi_{\theta+\delta\theta})-\gamma\nabla_{\delta\theta}D_{KL}(\pi_\theta||\pi_{\theta+\delta\theta}) = 0 \\ \nabla_{\gamma}L(\delta\theta,\gamma)= \epsilon - D_{KL}(\pi_\theta||\pi_{\theta+\delta\theta}) =0 ∇ δ θ L ( δ θ , γ ) = ∇ δ θ η ( π θ + δ θ ) − γ ∇ δ θ D K L ( π θ ∣∣ π θ + δ θ ) = 0 ∇ γ L ( δ θ , γ ) = ϵ − D K L ( π θ ∣∣ π θ + δ θ ) = 0 接下来,我们用近似法求解上式。
首先对η ( π θ + δ θ ) \eta(\pi_{\theta+\delta\theta}) η ( π θ + δ θ ) 进行一阶泰勒展开:
η ( π θ + δ θ ) ≈ η ( θ ) + δ θ T ∇ θ η ( π θ ) ∇ δ θ η ( π θ + δ θ ) ≈ ∇ θ η ( π θ ) \eta(\pi_{\theta+\delta\theta})\approx\eta(\theta)+\delta\theta^T\nabla_\theta\eta(\pi_\theta) \\\nabla_{\delta\theta} \eta(\pi_{\theta+\delta\theta}) \approx \nabla_\theta\eta(\pi_\theta) η ( π θ + δ θ ) ≈ η ( θ ) + δ θ T ∇ θ η ( π θ ) ∇ δ θ η ( π θ + δ θ ) ≈ ∇ θ η ( π θ ) 同样我们可以对D K L D_{KL} D K L 进行二阶泰勒展开:
D K L ( π θ ∣ ∣ π θ + δ θ ) ≈ D K L ( π θ ∣ ∣ π θ ) + δ θ T ∇ θ D K L ( π θ ∣ ∣ π θ + δ θ ) + 1 2 δ θ T ∇ θ 2 D K L ( π θ ∣ ∣ π θ + δ θ ) δ θ D_{KL}(\pi_\theta || \pi_{\theta+\delta\theta}) \approx D_{KL}(\pi_\theta||\pi_\theta)+\delta\theta^T\nabla_\theta D_{KL}(\pi_\theta||\pi_{\theta+\delta\theta})+\frac{1}{2}\delta\theta^T\nabla_\theta^2D_{KL}(\pi_\theta||\pi_{\theta+\delta\theta})\delta\theta D K L ( π θ ∣∣ π θ + δ θ ) ≈ D K L ( π θ ∣∣ π θ ) + δ θ T ∇ θ D K L ( π θ ∣∣ π θ + δ θ ) + 2 1 δ θ T ∇ θ 2 D K L ( π θ ∣∣ π θ + δ θ ) δ θ 我们知道对于两个分布P ( x ) , Q ( x ) P(x),Q(x) P ( x ) , Q ( x ) ,其KL散度:
D K L ( P ∣ ∣ Q ) = ∫ − ∞ ∞ P ( x ) log ( P ( x ) Q ( x ) ) d x = E x ∼ P ( x ) [ log P ( x ) Q ( x ) ] \begin{aligned}D_{KL}(P||Q) &= \int_{-\infty}^{\infty} P(x)\log(\frac{P(x)}{Q(x)})dx \\&=\mathbb{E}_{x\sim P(x)}[\log\frac{P(x)}{Q(x)}]\end{aligned} D K L ( P ∣∣ Q ) = ∫ − ∞ ∞ P ( x ) log ( Q ( x ) P ( x ) ) d x = E x ∼ P ( x ) [ log Q ( x ) P ( x ) ] 那么
∇ θ D K L ( π θ 0 ∣ ∣ π θ ) = ∇ θ E a ∼ π θ 0 [ log π θ 0 π θ ] = − E a ∼ π θ 0 [ ∇ log π θ ] = − E a ∼ π θ 0 [ 1 π θ 0 ∇ π θ ] = − ∫ A π θ 0 1 π θ 0 ∇ π θ = − ∇ ∫ A π θ = 0 \begin{aligned}\nabla_\theta D_{KL}(\pi_{\theta_0}||\pi_{\theta})&= \nabla_\theta\mathbb{E}_{a\sim\pi_{\theta_0}}[\log \frac{\pi_{\theta_0}}{\pi_{\theta}}] \\&= - \mathbb{E}_{a\sim\pi_{\theta_0}}[\nabla \log\pi_{\theta}] \\&= - \mathbb{E}_{a\sim\pi_{\theta_0}}[\frac{1}{\pi_{\theta_0}}\nabla \pi_{\theta}] \\&= -\int_{\mathcal{A}} \pi_{\theta_0}\frac{1}{\pi_{\theta_0}}\nabla\pi_\theta \\&= -\nabla \int_{\mathcal{A}}\pi_\theta \\&= 0\end{aligned} ∇ θ D K L ( π θ 0 ∣∣ π θ ) = ∇ θ E a ∼ π θ 0 [ log π θ π θ 0 ] = − E a ∼ π θ 0 [ ∇ log π θ ] = − E a ∼ π θ 0 [ π θ 0 1 ∇ π θ ] = − ∫ A π θ 0 π θ 0 1 ∇ π θ = − ∇ ∫ A π θ = 0 同理
∇ θ 2 D K L ( π θ 0 ∣ ∣ π θ ) = − E a ∼ π θ 0 [ ∇ θ 2 log π θ ] = − E a ∼ π θ 0 [ ∇ θ ( ∇ π θ π θ 0 ) ] = − E a ∼ π θ 0 [ ∇ θ 2 π θ π θ 0 − ∇ θ π θ ∇ θ π θ T π θ 0 2 ] = E a ∼ π θ 0 [ ∇ θ log π θ ∇ θ log π θ T ] \begin{aligned}\nabla_\theta^2 D_{KL}(\pi_{\theta_0}||\pi_{\theta})&= -\mathbb{E}_{a\sim\pi_{\theta_0}}[\nabla^2_\theta\log \pi_{\theta}] \\&= -\mathbb{E}_{a\sim\pi_{\theta_0}}[\nabla_\theta (\frac{\nabla \pi_{\theta}}{\pi_{\theta_0}})] \\&= -\mathbb{E}_{a\sim\pi_{\theta_0}}[\frac{\nabla^2_\theta \pi_{\theta} \pi_{\theta_0} - \nabla_\theta\pi_\theta \nabla_\theta\pi_\theta^T}{\pi_{\theta_0}^2}] \\&= \mathbb{E}_{a\sim \pi_{\theta_0}}[\nabla_\theta\log\pi_\theta \nabla_\theta\log\pi_\theta^T]\end{aligned} ∇ θ 2 D K L ( π θ 0 ∣∣ π θ ) = − E a ∼ π θ 0 [ ∇ θ 2 log π θ ] = − E a ∼ π θ 0 [ ∇ θ ( π θ 0 ∇ π θ )] = − E a ∼ π θ 0 [ π θ 0 2 ∇ θ 2 π θ π θ 0 − ∇ θ π θ ∇ θ π θ T ] = E a ∼ π θ 0 [ ∇ θ log π θ ∇ θ log π θ T ] 我们记Fisher Information Matrix
F θ = E a ∼ π θ [ ∇ θ log π θ ∇ θ log π θ T ] F_\theta = \mathbb{E}_{a\sim\pi_\theta}[\nabla_\theta\log\pi_\theta \nabla_\theta\log\pi_\theta^T] F θ = E a ∼ π θ [ ∇ θ log π θ ∇ θ log π θ T ] 因此
D K L ( π θ ∣ ∣ π θ + δ θ ) ≈ 1 2 δ θ T F θ δ θ ∇ δ θ D K L ( π θ ∣ ∣ π θ + δ θ ) ≈ F θ δ θ D_{KL}(\pi_\theta || \pi_{\theta+\delta\theta}) \approx \frac{1}{2}\delta\theta^TF_\theta\delta\theta \\ \nabla_{\delta\theta}D_{KL}(\pi_\theta||\pi_{\theta+\delta\theta}) \approx F_\theta\delta\theta D K L ( π θ ∣∣ π θ + δ θ ) ≈ 2 1 δ θ T F θ δ θ ∇ δ θ D K L ( π θ ∣∣ π θ + δ θ ) ≈ F θ δ θ 我们将上式代入∇ L \nabla L ∇ L 可得:
∇ δ θ L ( δ θ , γ ) = ∇ θ η ( π θ ) − γ F θ δ θ = 0 ∇ γ L ( δ θ , γ ) = 1 2 δ θ T F θ δ θ − ϵ = 0 \nabla_{\delta\theta}L(\delta\theta,\gamma)=\nabla_\theta\eta(\pi_\theta) - \gamma F_\theta\delta\theta = 0 \\ \nabla_{\gamma}L(\delta\theta,\gamma) = \frac{1}{2}\delta\theta^TF_\theta\delta\theta - \epsilon = 0 ∇ δ θ L ( δ θ , γ ) = ∇ θ η ( π θ ) − γ F θ δ θ = 0 ∇ γ L ( δ θ , γ ) = 2 1 δ θ T F θ δ θ − ϵ = 0 那么
δ θ = 1 γ F θ − 1 ∇ θ η ( π θ ) \delta\theta = \frac{1}{\gamma}F^{-1}_\theta\nabla_\theta\eta(\pi_\theta) \\ δ θ = γ 1 F θ − 1 ∇ θ η ( π θ ) 1 γ 2 ∇ θ η ( π θ ) T F θ − 1 F θ F θ − 1 ∇ θ η ( π θ ) = 1 γ 2 ∇ θ η ( π θ ) T F θ − 1 ∇ θ η ( π θ ) = 2 ϵ \frac{1}{\gamma^2}\nabla_\theta\eta(\pi_\theta)^TF_\theta^{-1}F_\theta F_\theta^{-1}\nabla_\theta \eta(\pi_\theta) = \frac{1}{\gamma^2}\nabla_\theta \eta(\pi_\theta)^TF_\theta^{-1}\nabla_\theta \eta(\pi_\theta) = 2\epsilon γ 2 1 ∇ θ η ( π θ ) T F θ − 1 F θ F θ − 1 ∇ θ η ( π θ ) = γ 2 1 ∇ θ η ( π θ ) T F θ − 1 ∇ θ η ( π θ ) = 2 ϵ δ θ = 2 ϵ ∇ θ η ( π θ ) T F θ − 1 ∇ θ η ( π θ ) F θ − 1 ∇ θ η ( π θ ) \delta\theta = \sqrt{\frac{2\epsilon}{\nabla_\theta \eta(\pi_\theta)^TF_\theta^{-1}\nabla_\theta \eta(\pi_\theta)}} F_\theta^{-1}\nabla_\theta\eta(\pi_\theta) δ θ = ∇ θ η ( π θ ) T F θ − 1 ∇ θ η ( π θ ) 2 ϵ F θ − 1 ∇ θ η ( π θ ) 我们记standard policy gradient g = ∇ θ η ( π θ ) g = \nabla_\theta \eta(\pi_\theta) g = ∇ θ η ( π θ ) ,那么
Natural policy gradient:
g N = 2 ϵ g T F θ − 1 g F θ − 1 g g_N = \sqrt{\frac{2\epsilon}{g^TF_\theta^{-1}g}}F_\theta^{-1}g g N = g T F θ − 1 g 2 ϵ F θ − 1 g 所以我们发现,natural policy gradient的step size及step direction都和fisher information matrix关系很大,那么理解fisher information matrix就变得至关重要。关于费雪信息的详细分析,建议移步这里 ,我这里简单的说明一下。
我们前面推导出以下:
F θ = E a ∼ π θ [ ∇ θ log π θ ∇ θ log π θ T ] ≈ ∇ θ 2 D K L ( π 0 ∣ ∣ π θ ) \begin{aligned}F_\theta &= \mathbb{E}_{a\sim\pi_\theta}[\nabla_\theta\log\pi_\theta \nabla_\theta\log\pi_\theta^T] \\&\approx \nabla^2_\theta D_{KL}(\pi_0||\pi_\theta)\end{aligned} F θ = E a ∼ π θ [ ∇ θ log π θ ∇ θ log π θ T ] ≈ ∇ θ 2 D K L ( π 0 ∣∣ π θ ) 我么称policy log likelihood的一阶导数为score function:
S θ = ∇ θ log π θ S_\theta = \nabla_\theta \log \pi_\theta S θ = ∇ θ log π θ 那么:
F θ = E a ∼ π θ [ S θ T S θ ] F_\theta = \mathbb{E}_{a\sim\pi_\theta}[S_\theta^TS_\theta] F θ = E a ∼ π θ [ S θ T S θ ] 当 θ ∈ R \theta \in \mathbb{R} θ ∈ R 时:
V a r ( S θ ) = E [ S θ 2 ] − E 2 [ S θ ] E [ S θ ] = ∫ A ∇ θ log π θ = 0 Var(S_\theta) = \mathbb{E}[S_\theta^2] - \mathbb{E}^2[S_\theta] \\\mathbb{E}[S_\theta] = \int_\mathcal{A}\nabla_\theta\log\pi_\theta = 0 \\ Va r ( S θ ) = E [ S θ 2 ] − E 2 [ S θ ] E [ S θ ] = ∫ A ∇ θ log π θ = 0 所以
F θ = V a r ( S θ ) F_\theta = Var(S_\theta) F θ = Va r ( S θ ) 而
S θ = ∇ θ log π θ ( s 0 , a 0 ) + ⋯ + ∇ θ log π θ ( s N , a N ) S_\theta = \nabla_\theta \log \pi_\theta(s_0, a_0)+\dots+\nabla_\theta \log \pi_\theta(s_N, a_N) S θ = ∇ θ log π θ ( s 0 , a 0 ) + ⋯ + ∇ θ log π θ ( s N , a N ) 也就是说,fisher information,等价于score function的方差。随着我们收集的数据点越来越多,N N N 越来越大,很显然S θ S_\theta S θ 的方差会越来越大,所以F θ F_\theta F θ 会越来越大。反过来说
Fisher information measures how many data we have collected, in the sense how accurate our estimation is.
我们在推导∇ θ 2 D K L \nabla^2_\theta D_{KL} ∇ θ 2 D K L 的时候间接证明了:
F θ = − E a ∼ π θ [ ∇ θ 2 log π θ ] F_\theta = -\mathbb{E}_{a\sim \pi_\theta}[\nabla^2_\theta \log\pi_\theta] F θ = − E a ∼ π θ [ ∇ θ 2 log π θ ] 也就是说,
Fisher information measures the curvature of log likelihood function.
而当 θ ∈ R N , N > 1 \theta \in \mathbb{R}^N, N > 1 θ ∈ R N , N > 1 时,同理可知
F θ = C o v ( S θ ) F_\theta =\mathrm{Cov}(S_\theta) F θ = Cov ( S θ ) 这时候,我们再过来看natural policy gradient:
g N = α F θ − 1 g g_N = \alpha F_\theta^{-1}g g N = α F θ − 1 g 也就是说,natural policy gradient是对standard policy gradient的修正,修正的方向与policy的score function的covariance matrix有关。只有当F θ = I F_\theta = I F θ = I ,我们才不做修正。当 θ \theta θ 中的某两个元素正相关性高时候,我们的修正会其对应方向的梯度减小,反过来会增大。也就是说:
Natural policy gradient corrects standard policy gradient according to its parameter correlation in action space measure such that policy iterates in orthogonal action space.