Bellman expectation equation 是强化学习中非常基础而且重要的概念,但是有些细节却不好理解,尤其是关于 E π \mathbb{E}_{\pi} Eπ (关于 policy π \pi π 的期望)的部分。在参考了Understanding RL: The Bellman Equations 和 Derivation of Bellman’s Equation 这两篇文章中的推导内容之后,特地将 Bellman 公式的推理过程整理在这里。
Given a finite set of states ( S S S) and actions ( A A A), the state transition probability is p ( s ′ ∣ s , a ) = P r ( S t + 1 = s ′ ∣ S t = s , A t = a ) p(s'|s, a) = Pr(S_{t+1}=s' \big| S_t =s, A_t=a) p(s′∣s,a)=Pr(St+1=s′∣∣St=s,At=a) and the reward is R ( r ∣ s , a , s ′ ) = P r ( R t + 1 = r ∣ S t = s , A t = a , S t + 1 = s ′ ) R(r|s,a,s') = Pr(R_{t+1}=r \big| S_t =s, A_t=a, S_{t+1}=s') R(r∣s,a,s′)=Pr(Rt+1=r∣∣St=s,At=a,St+1=s′) Notice the reward is actually a distribtion other than a determined value, this is root of some misunderstanding, because some articles only write the expectation.
Here we could put state transition and reward distribution as a single probability p ( s ′ , r ∣ s , a ) = P r ( S t + 1 = s ′ , R t + 1 = r ∣ S t = s , A t = a ) p(s', r| s,a) = Pr( S_{t+1}=s', R_{t+1}=r \big| S_t =s, A_t=a) p(s′,r∣s,a)=Pr(St+1=s′,Rt+1=r∣∣St=s,At=a)
Given a policy is a mapping from S S S to A A A like π ( a ∣ s ) \pi(a|s) π(a∣s).
The value function V π ( s ) = E π { ∑ k = 1 inf γ k R t + k + 1 ∣ S t = s } V_\pi(s) = \mathbb{E}_\pi\{\sum_{k=1}^{\inf}{\gamma^{k}R_{t+k+1}}\big| S_t =s\} Vπ(s)=Eπ{k=1∑infγkRt+k+1∣∣St=s} and action value function as q π ( s , a ) = E π { ∑ k = 1 inf γ k R t + k + 1 ∣ S t = s , A t = a } q_\pi(s,a) = \mathbb{E}_\pi\{\sum_{k=1}^{\inf}{\gamma^{k}R_{t+k+1}}\big| S_t =s, A_t=a\} qπ(s,a)=Eπ{k=1∑infγkRt+k+1∣∣St=s,At=a}
According to the well known law of total expectation and the state transition diagram
q π ( s , a ) = ∑ s ′ , r [ r + γ V π ( s ′ ) ] p ( s ′ , r ∣ s , a ) q_\pi(s,a) = \sum_{s',r}{[r+\gamma V_\pi(s')]p(s',r|s,a)} qπ(s,a)=s′,r∑[r+γVπ(s′)]p(s′,r∣s,a)
V π ( s ) = ∑ a π ( a ∣ s ) ∑ s ′ , r [ r + γ V π ( s ′ ) ] p ( s ′ , r ∣ s , a ) = ∑ a π ( a ∣ s ) q π ( s , a ) V_\pi(s) = \sum_a{\pi(a|s) \sum_{s',r}{[r+\gamma V_\pi(s')]p(s',r|s,a)}}\\ = \sum_a{\pi(a|s) q_\pi(s,a)} Vπ(s)=a∑π(a∣s)s′,r∑[r+γVπ(s′)]p(s′,r∣s,a)=a∑π(a∣s)qπ(s,a)