Published on

值分布强化学习 Distributional Reinforcement Learning

Authors

这个分支的几篇paper倒是有点意思。我们知道,RL的很多问题中,一旦P(ss,a),R(s,a),π(s,a)\mathcal{P}(s|s, a), \mathcal{R}(s, a),\pi(s,a)中有一个是non-deterministic,那么必然会使得Q(s,a)Q(s, a)也应该是non-determinsitc的,而我们之前提到的算法中, DNN输出的Q(s,a)Q(s, a)却是determinisitic的,那么训练中的target Q value必然会产生震荡。本节的几篇文章都是围绕着这个问题展开的。

A Distributional Perspective on Reinforcement Learning

C51的解决思路很简单。就是将Q(s,a)Q(s,a)的值分布视为在[10,10][-10, 10]之间的51个均匀区间的categorial distribution,如此,则Q(s,a)Q(s,a)的输出则为概率分布。那么这种情况下,Bellman Equation是如何实现的呢?

Qπ(s,a)=r(s,a)+γsSp(ss,a)aAπ(s,a)Qπ(s,a)=r(s,a)+γEs,a[Qπ(s,a)]\begin{aligned}Q^\pi(s,a) &= r(s,a) + \gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)\sum_{a'\in\mathcal{A}}\pi(s',a')Q^\pi(s',a')\\&= r(s, a) + \gamma \mathbb{E}_{s', a'}[Q^\pi(s',a')]\end{aligned}

如果Qπ(s,a)Q^\pi(s', a')是一个分布,那么Bellman Equation的RHS: r(s,a)+γQπ(s,a)r(s, a) + \gamma Q^\pi(s', a')是对原分布的变形,如原文中的下图所示:

其对应的伪代码如下:

我们接下来一行一行解释以上code。首先,算法规定了Q(s,a)Q(s,a)的取值范围, 比如[10,10][-10, 10], 那么将这个区间平均分成50份,那么一共会有以下51个端点{10,9.6,9.2,,9.6,10}={zi}\{-10, -9.6, -9.2, \dots, 9.6, 10\}=\{z_i\},文中被称为51个support {zi}\{z_i\},这也是算法被简称为C51的由来(51 Categories)。Q Network的output dimension 由原来的A|\mathcal{A}|变为A×51|\mathcal{A}|\times 51,且对每一个action对应的1×511\times 51 array取softmax以代表categorical probability {pi(s,a)}\{p_i(s, a)\},如此,则Q(s,a)=izipi(s,a)Q(s, a)=\sum_i z_i p_i(s, a)

我们以r=1,γ=0.99r = 1, \gamma = 0.99 为例,{zi}={Tzi}={[r+γzi]1010}={8.9,8.504,,9.712,10,10,10}\{z_i'\}=\{\mathcal{T}z_i\} = \{[r + \gamma z_i]_{-10}^{10}\} = \{-8.9, -8.504, \dots, 9.712, 10, 10, 10\},这一步就是所谓的compute the projection of Tzi\mathcal{T}z_i on to ziz_i。接下来,我们看看何为distribute probability of Tzi\mathcal{T} z_i。对于z0=8.9z'_0 = -8.9, 它在z2=9.2z_2 = -9.2z3=8.8z_3 = -8.8之间,我们需要将z0z'_0 对应的概率p0p_0 分配到z2,z3z_2,z_3上,因为z0z_0'z2,z3z_2, z_3的距离比为(9.28.9):(8.98.8)=3:1(9.2-8.9):(8.9-8.8)=3:1,所以将p0p_00.250.25分配到z2z_2上,0.750.75分配到z3z_3上,以此类推。然而GPU代码层面要实现以上逻辑会复杂的多,因为涉及到寻找一个batch的index的问题。

经过以上操作,Bellman Equation的RHS和LHS的categorical distribution即有相同的support, 那么两个分布之间的距离即可用cross entropy来衡量,并以此作为loss。

Distributional Reinforcement Learning with Quantile Regression

QR-DQN是在C51的基础上做了稍许改进,甚至和C51一枚硬币的两面。C51中,support是fixed and evenly distributed,Q Network输出的是support对应的probability;而在QRDQN中,probability是fixed and evenly distributed,Q Network输出的是support。

相比起C51,QR-DQN在code层面要简洁和简单很多。以作者的标准算法为例,Q Network的输出为A×200|\mathcal{A}| \times 200的array,每个action对应的1×2001 \times 200的array即代表quantile distribution的两百个evenly distributed quantiles {zi}\{z_i\},如此则Q(s,a)=N1i=1NziQ(s, a) = N^{-1}\sum_{i=1}^N z_i

那么quantile distributions之间的loss又如何计算呢?文中的下图即是作者的思路。不同于C51中计算又相同support的categorical distribution的cross entropy,QR-DQN想要用quantile distribution去逼近return的真实distribution,计算的是两个distributions的cumulated distribution probability(CDF)之间的1-Wasserstein error。

在具体计算中,作者用到了Quantile Regression的概念。所以接下来我们简单介绍下何为Quantile Regression。设random variable YY的cumulative distribution function为FY(y)=P(Yy)F_Y(y) = P(Y \leq y),我们设分为分位数τ[0,1]\tau \in [0, 1]对应的分位点为yˉ\bar y,则有P(Yyˉ)=τP(Y \leq \bar y) = \tau,亦或者yˉ=q(τ)=FY1(τ)=inf{y:FY(y)τ}\bar y = q(\tau) = F_Y^{-1}(\tau) = \inf \{y: F_Y(y) \geq \tau \}

那么问题来了,对于给定的FY(y)F_Y(y),如何求分位数τ\tau对应的分位点FY1(τ)F_Y^{-1}(\tau)呢?我们设Lτ(u)=E[(Yu)(τIY<u)]L_\tau(u) = \mathbb{E}[(Y-u)(\tau-\mathbb{I}_{Y<u})],则

q(τ)=argminuLτ(u)=argminuE[(Yu)(τIY<u)]=argminu{(τ1)u(yu)dFY(y)+τu(yu)dFY(y)}\begin{aligned}q(\tau) &= \arg \min_u L_\tau(u) \\&= \arg\min_u \mathbb{E}[(Y-u)(\tau-\mathbb{I}_{Y<u})] \\&= \arg \min_u \{ (\tau - 1)\int_{-\infty}^u(y - u)dF_Y(y) + \tau \int_{u}^\infty (y-u)dF_Y(y) \}\end{aligned}

其中

IY<u={1,Y<u0,Yu\mathbb{I}_{Y < u} = \begin{cases}1, Y < u \\0, Y \geq u\end{cases}

我们对Lτ(u)L_\tau(u)求导令其为零,则有

Lτ(u)u=(1τ)udFY(y)τudFY(y)=udFY(y)τdFY(y)=FY(u)τ=0\begin{aligned}\frac{\partial L_\tau(u)}{\partial u} &= (1-\tau)\int_{-\infty}^{u}dF_Y(y) - \tau \int_{u}^{\infty}dF_Y(y) \\&= \int_{-\infty}^u dF_Y(y) - \tau \int_{-\infty}^{\infty}dF_Y(y) \\&= F_Y(u) -\tau \\&= 0\end{aligned}

FY(u)=τ,u=F1(τ)=q(τ)F_Y(u) = \tau, u= F^{-1}(\tau)=q(\tau)。所以我们最小化Lτ(u)L_\tau(u),即可求得τ\tau的分位点q(τ)q(\tau)

如果FY(y)F_Y(y)是quantile distribution,那么quantile regression loss为:

Lτ(u)=(τ1)yi<u(yiu)+τyiq(yiu)=iτIyi<uyiuL_\tau(u) = (\tau - 1)\sum_{y_i < u}(y_i - u) + \tau \sum_{y_i\geq q}(y_i-u) = \sum_i |\tau - \mathbb{I}_{y_i<u}||y_i-u|

下面举例说明两个quantile distribution之间的QR loss。假设我们有quantile distribution Q={5,2,3,5,8}\mathcal{Q}=\{-5, -2, 3, 5, 8\} ,其对应的分位数τQ={0.1,0.3,0.5,0.7,0.9}\tau_Q=\{0.1, 0.3, 0.5, 0.7, 0.9\},另有quantile distribution P={3,1,2,4,7}\mathcal{P} = \{-3, -1, 2, 4, 7\}τP=τQ\tau_P = \tau_Q,那么P,Q\mathcal{P, Q}之间的Quantile Regression Loss

Lτ(P)=15(0.11×3+5+0.30×3+2+0.50×33+0.70×35+0.90×38+0.11×1+5+0.31×1+2+0.50×13+0.70×15+0.90×18++0.11×7+5+0.31×7+2+0.51×73+0.71×75+0.90×78)=18.8\begin{aligned}L_\tau(\mathcal{P}) =& \frac{1}{5}(|0.1 - 1|\times|-3+5| +|0.3-0|\times|-3+2| + |0.5-0|\times|-3-3|\\& + |0.7-0|\times|-3-5| + |0.9-0|\times|-3-8| + \\& |0.1-1|\times|-1+5| + |0.3-1|\times|-1+2| + |0.5-0|\times|-1-3|\\& + |0.7-0|\times|-1-5| + |0.9-0|\times|-1-8|+ \\& \dots+ \\& |0.1-1|\times|7+5|+|0.3-1|\times|7+2|+|0.5-1|\times|7-3|\\& +|0.7-1|\times|7-5|+|0.9-0|\times|7-8|) \\=& 18.8\end{aligned}

其对应code为

import torch
p = torch.tensor([-3, -1, 2, 4, 7])
q = torch.tensor([-5, -2, 3, 5, 8])
tau = torch.tensor([0.1, 0.3, 0.5, 0.7, 0.9])
diff = p.view(-1, 1) - q.view(1, -1)

weight = tau - (diff > 0).float()
loss = diff.abs() * weight.abs()
loss = loss.sum(-1).mean(-1)

Implicit Quantile Networks for Distributional Reinforcement Learning

IQN又在QR-DQN的基础上做了改进。我们知道,QR-DQN的分位数τ\tau是fixed and evenly distributed,所以很自然的想到,有没有办法也让τ\tau是variable?IQN的思路是在network方面动手,如文中的下图所示。

IQN的网络输入,除了RL的state之外,还有一个随机采样的分位数τ\tau,这个分位数被嵌入到网络中,并与convolution layers出来的feature互动,输出对应的分位点q(τ)q(\tau)

import torch
import torch.nn as nn
import numpy as np

class Network(nn.Module):
  # ......
  def forward(self, states, N):
    B = states.size(0) # Batch Size
    x = self.convs(states).view(B, -1) # convolution layers
    
    taus = torch.rand(B, N, 1) # Sample taus from uniform distribution
		pis = np.pi * torch.arange(1, self.num_cosines + 1).to(x).view(1, 1, -1)
    cosine = pis.mul(taus).cos().view(B * N, self.num_cosines)
    
    tau_embed = self.cosine_emb(cosine).view(B, N, -1)
    state_embed = x.view(B, 1, -1)
    features = (tau_embed * state_embed).view(B * N, -1)
    
    q = self.dense(features).view(B, N, -1) # B x N x |A|
    return q, taus

我在上面大概写了下算法对应的python code。网络每forward一次,便输出一个B×NB \times Nτ\tau,以及与其对应的B×N×AB\times N \times |\mathcal{A}|的quantiles,其余部分与QR-DQN相同。很多人会疑惑cosine embed这层的操作,其实这完全是调参试出来的结果,可以在Appendix里面的Figure 5看到,因为加了这个效果更好。

文中花了大量篇幅讲distortion risk measure,作者本期望通过distortion function来调整policy的risk偏好,结果发现还是identity distortion function,即neutral policy的总体效果最好。这里我们也简单解释下何为distortion risk measure。首先,我们需要定义一个distortion function D:[0,1][0,1]D:[0, 1] \rightarrow [0, 1],且要求DD是continuous and increasing, 如此则D(0)=0,D(1)=1D(0) = 0, D(1) = 1。之前我们提到,对于random variable YY,其CDF为FY(y)=P(Yy)F_Y(y) = P(Y \leq y),quantile function为q(τ)=FY1(τ)=inf{y:FY(y)τ},τ[0,1]q(\tau) = F_Y^{-1}(\tau) = \inf \{y: F_Y(y) \geq \tau \}, \tau \in [0, 1],那么distortion risk measure为

ρD(Y)=01FY1(τ)dD(τ)=yd(DF(y))\rho_D(Y) = \int_{0}^1 F_Y^{-1}(\tau)dD(\tau) = \int_{-\infty}^\infty y d(D \circ F(y))

而我们知道

E[Y]=yd(F(y))=yfY(y)dy\mathbb{E}[Y] = \int_{-\infty}^\infty y d(F(y)) = \int_{-\infty}^\infty yf_Y(y) dy

对比以上两式我们发现,本质上,distortion risk measure是在将CDF distort之后的expected 。我画出了文中提到的主要的distortion function D(τ)D(\tau),如下图

如果被distort的是normal distribution,那么distort之后的distribution如下

我们接着理解,为什么convex distortion function会给出risk seeking policy而concave distortion function会给出risk averse policy。我们看到,concave的Wang(0.75)\text{Wang}(-0.75)给出的distribution,相对于identity normal distribution而言向右移动了,也就是说,我们的policy更倾向于选higher estimated return的action(即便我们的estimation可能没那么好),而避免选择lower estimated return的action,这就是所谓的risk averse;反之,convex的Wang(0.75)\text{Wang}(0.75)是risk seeking,也是一样的道理。而这其实就是exploration和exploitation的问题,risk seeking policy更倾向于explore未知的领域,即便它可能带来lower return。但是基于estimated return来判断是否需要explore,这显然不太合适。

Fully Parameterized Quantile Function for Distributional Reinforcement Learning

FQF这篇文章又在IQN的基础上做了改进。我们知道,IQN的分位点τ\tau是随机采样的,FQF的想法是,τ\tau应该是根据输入的states产生的。如下图,作者认为,相比于随机采样的τ\tau,显然经过调整的τ\tau会使得1-Wasserstein的loss更小。

所以接下来要回答的问题是,如何调整τ\tau。要回答这个问题,需要系统的了解下Wasserstein Distance。

For two distrubtions U,VU, V, their Wasserstein Distance is given by:

Wp(U,Y)=(01FY1(ω)FU1(ω)pdω)1/pW_p(U, Y) = (\int_0^1|F_Y^{-1}(\omega) - F_U^{-1}(\omega)|^pd\omega)^{1/p}

where FY1(ω)F_Y^{-1}(\omega) is the inverse CDF of distrubtion YY, which means FY1(ω):=inf{yR:ωFY(y)}F^{-1}_Y(\omega) := \inf\{y\in \mathbb{R}: \omega \leq F_Y(y) \}

Inverse CDF这里解释的有点绕口,我们知道CDF是PDF(Probability Density Function)的积分: FY(y)=f(y)dyF_Y(y) = \int_{-\infty}^{\infty}f(y)dy,Inverse CDF就是CDF的反函数。在下图中:

f(x)=N(0,1)f(x) = \mathcal{N}(0,1) 为standard normal distribtuion,涂色面积ω\omega,那么 FX1(ω)=1F_X^{-1}(\omega) = -1

特殊情况下,如果U={θ1,,θN}U = \{\theta_1, \dots, \theta_N\}是quantile distribution with {τi}={FU(θi)},τ0=0,p=1\{\tau_i\}=\{F_U(\theta_i)\}, \tau_0 = 0, p=1,那么

W1(U,V)=i=1Nτi1τiFY1(ω)θidωW_1(U, V) = \sum_{i=1}^N \int_{\tau_{i-1}}^{\tau_i} |F_Y^{-1}(\omega) - \theta_i|d\omega

如下图的CDF,其中U={(τi,θi)}U=\{(\tau_i, \theta_i)\} with {θi}={3,2,1,5,5.5},{τi}={0.2,0.4,,1.0},τ0=0\{\theta_i\} = \{-3, -2, -1, 5, 5.5\}, \{\tau_i\} = \{0.2, 0.4, \dots, 1.0\}, \tau_0 = 0 and V=N(0,1)V=\mathcal{N}(0, 1),涂色部分的面积即为W1(U,V)W_1(U, V)

问题又来了,如果quantile distribution UU 的 support {θi}\{\theta_i\}是可调节的,那么如何才能使W1W_1最小呢?

For any τ,τ[0,1]\tau, \tau' \in [0, 1] and τ<τ\tau < \tau', and CDF FF with its inverse F1F^{-1}, the set of θR\theta \in \mathbb{R} minimizing

L(θ)=ττF1(ω)θdω L(\theta) = \int_{\tau}^{\tau'} |F^{-1}(\omega) - \theta| d\omega

is given by

{θRF(θ)=τ+τ2}\{\theta \in \mathbb{R} | F(\theta) = \frac{\tau + \tau'}{2}\}

Or

argminL(θ)=F1(τ+τ2)\arg\min L(\theta) = F^{-1}(\frac{\tau+\tau'}{2})

这部分其实很好证明。首先,显然L(θF(θ)<τ)>L(θF(θ)=τ),L(θF(θ)>τ)>L(θF(θ)=τ)L(\theta|F(\theta) < \tau) > L(\theta|F(\theta) = \tau), L(\theta|F(\theta) > \tau') > L(\theta|F(\theta) = \tau'),所以argminθL(θ)[F1(τ),F1(τ)]\arg\min_\theta L(\theta) \in [F^{-1}(\tau), F^{-1}(\tau')],那么

L(θ)=τF(θ)(θF1(ω))dω+F(θ)τ(F1(ω)θ)dω=θ(F(θ)τ+F(θ)τ)τF(θ)F1(ω)dωτF(θ)F1(ω)dω\begin{aligned}L(\theta) &= \int_{\tau}^{F(\theta)}(\theta - F^{-1}(\omega)) d\omega + \int_{F(\theta)}^{\tau'}(F^{-1}(\omega) - \theta)d\omega \\&= \theta (F(\theta) - \tau + F(\theta) - \tau') - \int_{\tau}^{F(\theta)}F^{-1}(\omega)d\omega - \int_{\tau'}^{F(\theta)}F^{-1}(\omega)d\omega\end{aligned}
L(θ)θ=2F(θ)ττ+2θF(θ)θF1(F(θ))F(θ)θF1(F(θ))F(θ)θ=2F(θ)ττ\begin{aligned}\frac{\partial L(\theta)}{\partial\theta}&=2F(\theta) - \tau - \tau' + 2\theta \frac{\partial F(\theta)}{\partial\theta} - F^{-1}(F(\theta)) \frac{\partial F(\theta)}{\partial \theta} - F^{-1}(F(\theta)) \frac{\partial F(\theta)}{\partial \theta}\\&= 2F(\theta) - \tau - \tau'\end{aligned}

那么对于上面提到的quantile distribution UU with quantile fractions {τ0,,τN}\{\tau_0, \dots, \tau_N\} and τ0=0,τN=1,τiτi+1\tau_0 = 0, \tau_N = 1, \tau_i \leq \tau_{i+1} 和 continuous distribution VV,

argminθW1(U,V)={θiFV(θi)=τi1+τi2,i=1,,N}\arg\min_\theta W_1(U, V) = \{\theta_i | F_V(\theta_i) = \frac{\tau_{i-1} + \tau_i}{2}, i = 1, \dots, N\}

我们设τ^i=(τi+τi1)/2\hat{\tau}_i = (\tau_i + \tau_{i-1})/2,那么

minW1(U,V)=i=1Nτi1τiF1(ω)F1(τ^i)\min W_1 (U, V) = \sum_{i=1}^N\int_{\tau_{i-1}}^{\tau_{i}}|F^{-1}(\omega) - F^{-1}(\hat{\tau}_i)|

我们又设 L(τi)=minW1(U,V)L(\tau_i) = \min W_1(U, V),那么

L(τi)τi=2FV1(τi)FV1(τ^i)FV1(τ^i1)\frac{\partial L(\tau_i)}{\partial \tau_i} = 2F^{-1}_V(\tau_i) - F^{-1}_V(\hat{\tau}_i) - F^{-1}_V(\hat{\tau}_{i-1})

上式的证明可参考文章的Appendix,也是FQF算法的核心结论。另外,很容易推出

Q(s,a)=i=1N(τiτi1)FV1(τ^i)Q(s, a) = \sum_{i=1}^N(\tau_i - \tau_{i-1}) F_V^{-1}(\hat{\tau}_i)

文章其他部分与IQN相同。我这里摘抄对应的code方便解释。

class Network(nn.Module)
  ...

  def taus_prop(self, x):
    batch_size = x.size(0)
    log_probs = self.fraction_net(x).log_softmax(dim=-1) # Faction Network: B X F => B X P
    probs = log_probs.exp()
    tau0 = torch.zeros(batch_size, 1).to(x)
    tau_1n = torch.cumsum(probs, dim=-1)

    taus = torch.cat((tau0, tau_1n), dim=-1) # B X (P + 1)
    taus_hat = (taus[:, :-1] + taus[:, 1:]).detach() / 2.0 # B X P
    entropies = probs.mul(log_probs).neg().sum(dim=-1, keepdim=True)
    return taus.unsqueeze(-1), taus_hat.unsqueeze(-1), entropies
  
  def calc_fqf_q(self, x):
    if st.ndim == 4:
        convs = self.convs(st)
    else:
        convs = st
    # taus: B X (N+1) X 1, taus_hats: B X N X 1
    taus, tau_hats, _ = self.taus_prop(convs.detach())
    q_hats, _ = self.forward(convs, taus=tau_hats)
    q = ((taus[:, 1:, :] - taus[:, :-1, :]) * q_hats).sum(dim=1)
    return q

  def forward(self, x, taus_hat=None):
    B = states.size(0) # Batch Size
    x = self.convs(states).view(B, -1) # convolution layers, output x: B X F
    
    if taus_hat is None:
      _, taus_hat, _ = self.taus_prop(x.detach())
    
    N = taus_hat.size(1)

    ipi = np.pi * torch.arange(1, self.cfg.num_cosines + 1).to(x).view(1, 1, self.cfg.num_cosines)
    cosine = pis.mul(taus_hat).cos().view(B * N, self.num_cosines)
    
    tau_embed = self.cosine_emb(cosine).view(B, N, -1)
    state_embed = x.view(B, 1, -1)
    features = (tau_embed * state_embed).view(B * N, -1)
    
    q = self.dense(features).view(B, N, -1) # B x N x |A|
    return q, taus_hat

class Agent:
  ...

  def step(self, states, next_states, actions, terminals, rewards):
    q_convs = self.model.convs(states)
    taus, tau_hats, _ = self.model.taus_prop(q_convs.detach())
    # q_hat: B X N X A
    q_hat, _ = self.model.forward(q_convs, taus_hat=tau_hats)
    q_hat = q_hat[self.batch_indices, :, actions]

    with torch.no_grad():
      q_next_convs = self.model_target.convs(next_states)
      q_next_ = self.model_target.calc_fqf_q(q_next_convs)
      a_next = q_next_.argmax(dim=-1)

      q_next, _ = self.model_target.forward(q_next_convs, taus=tau_hats)
      q_next = q_next[self.batch_indices, :, a_next]
      q_target = rewards.unsqueeze(-1).add(
        self.cfg.discount ** self.cfg.n_step * (1 - terminals.unsqueeze(-1)) * q_next)

    q_current_, q_target_ = q_hat..unsqueeze(1), q_target.unsqueeze(-1)
    diff = q_hat.unsqueeze(1) - q_target.unsqueeze(-1)
    # q_current_: B X 1 X N
    # tau_hats: B X 1 X N
    # q_target_: B X N X 1
    loss = nn.functional.smooth_l1_loss(diff) * (taus_hat - q_target.lt(q).detach().float()).abs()

    # Calculate Fraction Loss
    with torch.no_grad():
      q, _ = self.model.forward(q_convs, taus_hat=taus[:, 1:-1])
      q = q[self.batch_indices, :, actions]
      values_1 = q - q_hat[:, :-1]
      signs_1 = q.gt(torch.cat((q_hat[:, :1], q[:, :-1]), dim=1))

      values_2 = q - q_hat[:, 1:]
      signs_2 = q.lt(torch.cat((q[:, 1:], q_hat[:, -1:]), dim=1))
    gradients_of_taus = (torch.where(signs_1, values_1, -values_1) + torch.where(signs_2, values_2, -values_2)
      ).view(self.cfg.batch_size, self.cfg.N_fqf - 1)
    fraction_loss = (gradients_of_taus * taus[:, 1:-1, 0]).sum(dim=1).view(-1)

计算Faction Loss之前其实很好懂,相当于用fraction network产生的{τ^i}\{\hat{\tau}_i\}代替随机生成的,所以我们着重理解Calculate Fraction Loss这部分,对照着L(τi)/τi\partial L(\tau_i) /\partial \tau_i其实很好理解。

q其实在计算FV1(τi)F^{-1}_V(\tau_i)q_hat[:, :1]q_hat[:,1:]对应的是FV1(τ^i1)F^{-1}_V(\hat{\tau}_{i-1})FV1(τ^i)F^{-1}_V(\hat{\tau}_{i})gradients_of_taus计算的是

L(τi)/τi=FV1(τi)FV1(τ^i1)+FV1(τi)FV1(τ^i)\partial L(\tau_i) /\partial \tau_i = |F^{-1}_V(\tau_i) -F^{-1}_V(\hat{\tau}_{i-1})| + |F^{-1}_V(\tau_i) -F^{-1}_V(\hat{\tau}_{i})|

到这里大家可能会疑惑为何不套用解开绝对值之后的公式,原因很简单,因为我们的neural network计算出来的FV1F_V^{-1}并不是严格的non-decreasing,直接套用会导致梯度计算并不准确。为了解决这一问题,后续有作品提出non-decreasing quantile network,这里就不多说了。