强化学习note1导论

it2023-02-13  80

Textbook:Sutton and Barton reinforcement learning

周博磊老师中文课

coding

架构:Pytorch

与supervised learning 的区别:监督学习:1.假设数据之间无关联i.i.d. 2.有label 强化学习:不一定i.i.d;没有立刻feed back(delay reward)

​ exploration(采取新行为)&exploitation(采取目前最好的行为)

feature:

Trial-and-error explorationDelay rewardtime matters(sequential ,non i.i.d)Agent’s actions affect the subsequential data it recieves

compared with supervised learning,reinforcement learaning can sometimes surpass the behavior of human

possible rollout sequence

agent&environment

rewards:scalar feedback

sequential decision making:

近期与远期奖励的trade off

full observation&partial observation

RL Agent:

component:
1.policy:agent’s behavior function

from state/observation to action

stochastic policy:Probabilistic sample: π ( a ∣ s ) = P [ A t = a ∣ S t = s ] \pi(a|s)=P[A_t=a|S_t=s] π(as)=P[At=aSt=s]

deterministic policy: a ∗ = a r g m a x a   π ( a ∣ s ) a^*=arg\underset{a}{max}\,\pi(a|s) a=argamaxπ(as)

2.value function:

expected discounted sum of future rewards under a particular policy π \pi π

discount factor weights immediate vs future rewards

used to quantify goodness/badness of states and actions v π ( s ) = △ E π [ G t ] = E π [ ∑ k = 0 γ k R t + k + 1 ∣ S t = s ] v_{\pi}(s)\overset{\triangle}=E_\pi[G_t]=E_\pi[\sum_{k=0}\gamma^kR_{t+k+1}|S_t=s] vπ(s)=Eπ[Gt]=Eπ[k=0γkRt+k+1St=s] Q-function(use to select among actions): q π ( s , a ) = △ E π [ G T ∣ S t , A t = a ] q_\pi(s,a)\overset{\triangle}=E_\pi[G_T|S_t,A_t=a] qπ(s,a)=Eπ[GTSt,At=a]

5.Model

A model predicts what the environment will do next

Types of RL Agents based on what the Agent Learns
Value-based agent

显示学习价值函数、隐式学习策略

Policy-based agent

显示学习policy、no value function

Actor-critic agent

结合policy and value function

Types of RL Agents based on what the Agent Learns
Model-based

直接学习model(环境转移)

Model-free

直接学习value function/policy function

No model

Exploration and Exploitation

exploration:进行试错

exploitation:选择已知情况下的最优

import gym #这个锤子在python里面跑,一跑就卡... env=gym.make('CartPole-v0') env.reset() env.render() action=env.action_space.sample() observation,reward,done,info=env.step(action)

exmaple

next calss:Markov决策过程

最新回复(0)