RL的基本概念

• States: $\mathcal{S}$

• Actions: $\mathcal{A}$

• Transition function: $\mathcal{T} : \mathcal{S} \times \mathcal{A} \times \mathcal{S} \rightarrow [0,1]$

• Reward function $\mathcal{R} : \mathcal{S} \times \mathcal{A} \times \mathbb{R} \rightarrow[0,1]$

• Policy function $\pi: \mathcal{S} \times \mathcal{A} \rightarrow [0, 1]$

• Value function $V : \mathcal{S} \rightarrow \mathbb{R}$

• discount factor $\gamma \in (0, 1]$

• Initial state distribution $d_0 : \mathcal{S} \rightarrow [0, 1]$

• Trajectory: $\tau = (s_0, a_0, r_1, s_1, a_1, \dots)$ where $s_0 \sim d_0(s), a_0 \sim \pi(s_0), r_1 \sim \mathcal{R}(s_0,a_0), s_1 \sim \mathcal{T}(s, a), \dots$

• Return: $R(\tau) = \sum_{t=0}^{\infty} \gamma^t r_{t+1}$

Optimal Policy:

$\pi^* = \arg\max_\pi \mathbb{E}_{\tau \sim \pi} \{R(\tau) \}$

Optimal Value:

$V^*(s) = V^{\pi^*}(s) = \max_\pi \mathbb{E}_{\tau \sim \pi, s_0 = s} \{R(\tau) \} \\ Q^*(s, a) = Q^{\pi^*}(s,a) = \max_\pi \mathbb{E}_{\tau \sim \pi, s_0 = s, a_0=a} \{R(\tau) \}$

Value function:

$V^{\pi}(s) = \mathbb{E}_{\tau \sim \pi, s_0 = s} \{R(\tau) \} = \sum_{a \in \mathcal{A}} \pi(s,a) \{ \mathbb{E}[\mathcal{R}(s,a)] + \gamma \sum_{s' \in \mathcal{S}} V^\pi(s')\mathcal{T}(s,a,s') \}$

Obviously:

For any policy $\pi$,

$\quad V^{\pi} \leq V^*$

PAC Learner

An algorithm $\mathcal{L}$ is a PAC learner if for any environment $\mathcal{E} = (\mathcal{S, A, T, R}, d_0), \epsilon > 0, \delta > 0, \gamma \in (0, 1]$, it produce a policy $\pi$ such that:

$\Pr \{ |V^{*}(s_0) - V^{\pi}(s_0)| \leq \epsilon\}\geq 1-\delta$

in time polynomial in $N = |\mathcal{S}|, K=|\mathcal{A}|, r_{\max }= \max \mathcal{R}(s,a), 1/\epsilon, 1/\delta, 1/(1-\gamma)$

PAC Learner Algorithm

1. Record:

$m_t(i, a)$: counts of action $k$ is executed in state $i$ at timestep $t$

$n_{a,t}(i,j)$: counts of observed state transition $i$ to $j$ under action $a$ at timestep $t$

$R_{k,t}(i,j)$: total rewards received on state transition $i$ to $j$ under action $k$ at timestep $t$

2. Performce $M$ step epsiode according to exploration policy $\hat{\pi}^e$, update recordings.

3. Upon reaching a threshold, stop recording, compute local optimal policy $\hat{\pi}^*$;

Otherwise, repeat step 2.

1. 如何确定步长$M$?
2. 如何实现探索策略$\hat{\pi}^e$?
3. 如何确定停止条件threshold？
4. 如何计算最优策略$\hat{\pi}^*$?

值函数有限步长误差分析

For any $s_0 \in \mathcal{S}$,

$|V^\pi(s_0) - V_M^{\pi}(s_0) | \leq \gamma^{M}r_{\max } / (1 - \gamma)$

$|V^\pi(s_0) - V_M^{\pi}(s_0) | = \gamma^M |\mathbb{E} [ R(\tau_{>M}) ]| \leq \gamma^M v_{\max }, \quad v_{\max } = r_{\max } / (1 - \gamma)$

$|V^\pi(s_0) - V_M^{\pi}(s_0) | \leq \gamma^M v_{\max} < \epsilon_c$

$M \geq \log_\gamma (\epsilon_c / v_{\max}) = \ln(v_{\max } / \epsilon_c) / (\ln(1/\gamma)) \\$

Use $\ln x \leq x - 1$ for $x \in (0, 1]$,

$M \geq ln(v_{\max} / \epsilon_c) /(1-\gamma)$

$\hat{V}^{\pi}(s_0) = \frac{1}{N}(R(\tau_1)+R(\tau_2)+ \dots+ R(\tau_N)), \quad \tau_1, \tau_2, \dots, \tau_N \sim (s_0, \pi)$

$\hat{V}^{\hat{\pi}^*}(s_t) \geq \hat{V}^{\pi}(s_t)$

If $\Delta^\pi (s_0) \leq \epsilon_s = \epsilon / 3$ for any $\pi$, then

$| V^{\pi^*}(s_0) - V^{\hat{\pi}^*}(s_0) | \leq \epsilon$

For any given history of experiments, with probability no less thant $1-MNK\delta_0$, we have for every policy $\pi$ that

$\Delta_t^\pi(s) \leq \eta \hat{d}_t^\pi(s)$

where $0 \leq t \leq M, N = |\mathcal{S}|, K=|\mathcal{A}|$