Bellman equation in RL

it2025-02-02  27

文章目录

Bellman equationDefinition

Bellman equation

Bellman expectation equation 是强化学习中非常基础而且重要的概念,但是有些细节却不好理解,尤其是关于 E π \mathbb{E}_{\pi} Eπ (关于 policy π \pi π 的期望)的部分。在参考了Understanding RL: The Bellman Equations 和 Derivation of Bellman’s Equation 这两篇文章中的推导内容之后,特地将 Bellman 公式的推理过程整理在这里。

Definition

Given a finite set of states ( S S S) and actions ( A A A), the state transition probability is p ( s ′ ∣ s , a ) = P r ( S t + 1 = s ′ ∣ S t = s , A t = a ) p(s'|s, a) = Pr(S_{t+1}=s' \big| S_t =s, A_t=a) p(ss,a)=Pr(St+1=sSt=s,At=a) and the reward is R ( r ∣ s , a , s ′ ) = P r ( R t + 1 = r ∣ S t = s , A t = a , S t + 1 = s ′ ) R(r|s,a,s') = Pr(R_{t+1}=r \big| S_t =s, A_t=a, S_{t+1}=s') R(rs,a,s)=Pr(Rt+1=rSt=s,At=a,St+1=s) Notice the reward is actually a distribtion other than a determined value, this is root of some misunderstanding, because some articles only write the expectation.

Here we could put state transition and reward distribution as a single probability p ( s ′ , r ∣ s , a ) = P r ( S t + 1 = s ′ , R t + 1 = r ∣ S t = s , A t = a ) p(s', r| s,a) = Pr( S_{t+1}=s', R_{t+1}=r \big| S_t =s, A_t=a) p(s,rs,a)=Pr(St+1=s,Rt+1=rSt=s,At=a)

Given a policy is a mapping from S S S to A A A like π ( a ∣ s ) \pi(a|s) π(as).

The value function V π ( s ) = E π { ∑ k = 1 inf ⁡ γ k R t + k + 1 ∣ S t = s } V_\pi(s) = \mathbb{E}_\pi\{\sum_{k=1}^{\inf}{\gamma^{k}R_{t+k+1}}\big| S_t =s\} Vπ(s)=Eπ{k=1infγkRt+k+1St=s} and action value function as q π ( s , a ) = E π { ∑ k = 1 inf ⁡ γ k R t + k + 1 ∣ S t = s , A t = a } q_\pi(s,a) = \mathbb{E}_\pi\{\sum_{k=1}^{\inf}{\gamma^{k}R_{t+k+1}}\big| S_t =s, A_t=a\} qπ(s,a)=Eπ{k=1infγkRt+k+1St=s,At=a}

According to the well known law of total expectation and the state transition diagram

q π ( s , a ) = ∑ s ′ , r [ r + γ V π ( s ′ ) ] p ( s ′ , r ∣ s , a ) q_\pi(s,a) = \sum_{s',r}{[r+\gamma V_\pi(s')]p(s',r|s,a)} qπ(s,a)=s,r[r+γVπ(s)]p(s,rs,a)

V π ( s ) = ∑ a π ( a ∣ s ) ∑ s ′ , r [ r + γ V π ( s ′ ) ] p ( s ′ , r ∣ s , a ) = ∑ a π ( a ∣ s ) q π ( s , a ) V_\pi(s) = \sum_a{\pi(a|s) \sum_{s',r}{[r+\gamma V_\pi(s')]p(s',r|s,a)}}\\ = \sum_a{\pi(a|s) q_\pi(s,a)} Vπ(s)=aπ(as)s,r[r+γVπ(s)]p(s,rs,a)=aπ(as)qπ(s,a)

最新回复(0)