这是一篇1994年的老文章,也是分析强化学习算法复杂度比较经典的一篇文章。本文提出一种算法PAC framework下的强化学习算法,能在多项式步数内找到(ϵ,σ)(\epsilon,\sigma)-optimal policy。

RL的基本概念

  • States: S\mathcal{S}

  • Actions: A\mathcal{A}

  • Transition function: T:S×A×S[0,1]\mathcal{T} : \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow [0,1]

  • Reward function R:S×A×R[0,1]\mathcal{R} : \mathcal{S} \times \mathcal{A} \times \mathbb{R} \rightarrow[0,1]

  • Policy function π:S×A[0,1]\pi: \mathcal{S} \times \mathcal{A} \rightarrow [0, 1]

  • Value function V:SRV : \mathcal{S} \rightarrow \mathbb{R}

  • discount factor $\gamma \in (0, 1] $

  • Initial state distribution d0:S[0,1]d_0 : \mathcal{S} \rightarrow [0, 1]

  • Trajectory: τ=(s0,a0,r1,s1,a1,)\tau = (s_0, a_0, r_1, s_1, a_1, \dots) where s0d0(s),a0π(s0),r1R(s0,a0),s1T(s,a),s_0 \sim d_0(s), a_0 \sim \pi(s_0), r_1 \sim \mathcal{R}(s_0,a_0), s_1 \sim \mathcal{T}(s, a), \dots

  • Return: R(τ)=t=0γtrt+1R(\tau) = \sum_{t=0}^{\infty} \gamma^t r_{t+1}

Optimal Policy:

π=argmaxπEτπ{R(τ)}\pi^* = \arg\max_\pi \mathbb{E}_{\tau \sim \pi} \{R(\tau) \}

Optimal Value:

V(s)=Vπ(s)=maxπEτπ,s0=s{R(τ)}Q(s,a)=Qπ(s,a)=maxπEτπ,s0=s,a0=a{R(τ)}V^*(s) = V^{\pi^*}(s) = \max_\pi \mathbb{E}_{\tau \sim \pi, s_0 = s} \{R(\tau) \} \\ Q^*(s, a) = Q^{\pi^*}(s,a) = \max_\pi \mathbb{E}_{\tau \sim \pi, s_0 = s, a_0=a} \{R(\tau) \}

Value function:

Vπ(s)=Eτπ,s0=s{R(τ)}=aAπ(s,a){E[R(s,a)]+γsSVπ(s)T(s,a,s)}V^{\pi}(s) = \mathbb{E}_{\tau \sim \pi, s_0 = s} \{R(\tau) \} = \sum_{a \in \mathcal{A}} \pi(s,a) \{ \mathbb{E}[\mathcal{R}(s,a)] + \gamma \sum_{s' \in \mathcal{S}} V^\pi(s')\mathcal{T}(s,a,s') \}

Obviously:

For any policy π\pi,

VπV\quad V^{\pi} \leq V^*

PAC Learner

这里定义本文最重要的概念PAC Learner (Probability Approximate Correct) :

An algorithm L\mathcal{L} is a PAC learner if for any environment E=(S,A,T,R,d0),ϵ>0,δ>0,γ(0,1]\mathcal{E} = (\mathcal{S, A, T, R}, d_0), \epsilon > 0, \delta > 0, \gamma \in (0, 1], it produce a policy π\pi such that:

Pr{V(s0)Vπ(s0)ϵ}1δ\Pr \{ |V^{*}(s_0) - V^{\pi}(s_0)| \leq \epsilon\}\geq 1-\delta

in time polynomial in N=S,K=A,rmax=maxR(s,a),1/ϵ,1/δ,1/(1γ)N = |\mathcal{S}|, K=|\mathcal{A}|, r_{\max }= \max \mathcal{R}(s,a), 1/\epsilon, 1/\delta, 1/(1-\gamma)

就是说,假如有一个算法,它能在一定时间内,得到一个策略π\pi,使得VπV^{\pi}VV^*的误差小于ϵ\epsilon的概率大于1δ1-\delta,这里的一定时间是关于N,K,rmax,1/ϵ,1/δ,1/(1γ)N, K, r_{\max }, 1/\epsilon, 1/\delta, 1/(1-\gamma)的多项式,那么这个算法是PAC Learner。

PAC Learner Algorithm

  1. Record:

    mt(i,a)m_t(i, a): counts of action kk is executed in state ii at timestep tt

    na,t(i,j)n_{a,t}(i,j): counts of observed state transition ii to jj under action aa at timestep tt

    Rk,t(i,j)R_{k,t}(i,j): total rewards received on state transition ii to jj under action kk at timestep tt

  2. Performce MM step epsiode according to exploration policy π^e\hat{\pi}^e, update recordings.

  3. Upon reaching a threshold, stop recording, compute local optimal policy π^\hat{\pi}^*;

    Otherwise, repeat step 2.

那么本文要回答以下问题:

  1. 如何确定步长MM?
  2. 如何实现探索策略π^e\hat{\pi}^e?
  3. 如何确定停止条件threshold?
  4. 如何计算最优策略π^\hat{\pi}^*?

要回答以上问题,我么需要有一些理论铺垫。

值函数有限步长误差分析

我们首先考虑限制步数为MM的MDP, τ=(s0,a0,r1,,sM),R(τ)=t=0M2γtrt+1,VMπ(sM)=0,VMπ(s0)=Eτ(s0,π){R(τ)}\tau = (s_0, a_0, r_1, \dots, s_M), \quad R(\tau) = \sum_{t=0}^{M-2}\gamma^t r_{t+1}, \quad V_M ^\pi(s_M) = 0, V_M^\pi(s_0) = \mathbb{E}_{\tau \sim (s_0, \pi)}\{R(\tau)\}, 那么Vπ(s0)=limMVMπ(s0)V^{\pi}(s_0) = \lim_{M\rightarrow \infty}V_M^{\pi}(s_0),我们有以下结论:

For any s0Ss_0 \in \mathcal{S},

Vπ(s0)VMπ(s0)γMrmax/(1γ)|V^\pi(s_0) - V_M^{\pi}(s_0) | \leq \gamma^{M}r_{\max } / (1 - \gamma)

这个很容易证明, 对于 τ>M=(sM,aM,rM+1,sM+1,)\tau_{>M} = (s_M, a_M, r_{M+1}, s_{M+1}, \dots)

Vπ(s0)VMπ(s0)=γME[R(τ>M)]γMvmax,vmax=rmax/(1γ)|V^\pi(s_0) - V_M^{\pi}(s_0) | = \gamma^M |\mathbb{E} [ R(\tau_{>M}) ]| \leq \gamma^M v_{\max }, \quad v_{\max } = r_{\max } / (1 - \gamma)

我们要求

Vπ(s0)VMπ(s0)γMvmax<ϵc|V^\pi(s_0) - V_M^{\pi}(s_0) | \leq \gamma^M v_{\max} < \epsilon_c

那么

Mlogγ(ϵc/vmax)=ln(vmax/ϵc)/(ln(1/γ))M \geq \log_\gamma (\epsilon_c / v_{\max}) = \ln(v_{\max } / \epsilon_c) / (\ln(1/\gamma)) \\

Use lnxx1\ln x \leq x - 1 for x(0,1]x \in (0, 1],

Mln(vmax/ϵc)/(1γ)M \geq ln(v_{\max} / \epsilon_c) /(1-\gamma)

对于步数为M的MDP, τ=(s0,a0,r1,,sM)\tau = (s_0, a_0, r_1, \dots, s_M), 考虑值函数估算

V^π(s0)=1N(R(τ1)+R(τ2)++R(τN)),τ1,τ2,,τN(s0,π)\hat{V}^{\pi}(s_0) = \frac{1}{N}(R(\tau_1)+R(\tau_2)+ \dots+ R(\tau_N)), \quad \tau_1, \tau_2, \dots, \tau_N \sim (s_0, \pi)

我们称 π^(s)=argmaxπV^π(s)\hat{\pi}^*(s) = \arg\max_\pi \hat{V}^{\pi}(s)为observed optimal policy, 显然

V^π^(st)V^π(st)\hat{V}^{\hat{\pi}^*}(s_t) \geq \hat{V}^{\pi}(s_t)

我们定义Δπ(s0)=V^π(s0)Vπ(s0)\Delta^\pi (s_0) = |\hat{V}^{\pi}(s_0) - V^\pi(s_0)|,则有以下结论:

If Δπ(s0)ϵs=ϵ/3\Delta^\pi (s_0) \leq \epsilon_s = \epsilon / 3 for any π\pi, then

Vπ(s0)Vπ^(s0)ϵ| V^{\pi^*}(s_0) - V^{\hat{\pi}^*}(s_0) | \leq \epsilon

此处证明从略。用人话讲就是,optimal policy和observed optimal policy的值函数误差,不大于采样策略的值函数估算误差的三倍。

接下来,我跟给出以下引理:

For any given history of experiments, with probability no less thant 1MNKδ01-MNK\delta_0, we have for every policy π\pi that

Δtπ(s)ηd^tπ(s)\Delta_t^\pi(s) \leq \eta \hat{d}_t^\pi(s)

where 0tM,N=S,K=A0 \leq t \leq M, N = |\mathcal{S}|, K=|\mathcal{A}|