强化学习note1导论

it2023-02-13 98

Textbook:Sutton and Barton reinforcement learning

周博磊老师中文课

coding

架构:Pytorch

与supervised learning 的区别:监督学习:1.假设数据之间无关联i.i.d. 2.有label 强化学习:不一定i.i.d;没有立刻feed back(delay reward)

exploration(采取新行为)&exploitation(采取目前最好的行为)

feature:

Trial-and-error explorationDelay rewardtime matters(sequential ,non i.i.d)Agent’s actions affect the subsequential data it recieves

compared with supervised learning,reinforcement learaning can sometimes surpass the behavior of human

possible rollout sequence

agent&environment

rewards:scalar feedback

sequential decision making:

近期与远期奖励的trade off

full observation&partial observation

RL Agent:

component:

1.policy:agent’s behavior function

from state/observation to action

stochastic policy:Probabilistic sample: $\pi(a|s)=P[A_t=a|S_t=s]$

deterministic policy: $a^*=arg\underset{a}{max}\,\pi(a|s)$

2.value function:

expected discounted sum of future rewards under a particular policy $\pi$

discount factor weights immediate vs future rewards

used to quantify goodness/badness of states and actions $v_{\pi}(s)\overset{\triangle}=E_\pi[G_t]=E_\pi[\sum_{k=0}\gamma^kR_{t+k+1}|S_t=s]$ Q-function(use to select among actions): $q_\pi(s,a)\overset{\triangle}=E_\pi[G_T|S_t,A_t=a]$

5.Model

A model predicts what the environment will do next

Types of RL Agents based on what the Agent Learns

Value-based agent

显示学习价值函数、隐式学习策略

Policy-based agent

显示学习policy、no value function

Actor-critic agent

结合policy and value function

Types of RL Agents based on what the Agent Learns

Model-based

直接学习model(环境转移)

Model-free

直接学习value function/policy function

No model

Exploration and Exploitation

exploration:进行试错

exploitation:选择已知情况下的最优

import gym #这个锤子在python里面跑，一跑就卡... env=gym.make('CartPole-v0') env.reset() env.render() action=env.action_space.sample() observation,reward,done,info=env.step(action)

exmaple

next calss:Markov决策过程

最新回复(0)