Policy-based Approach(基于策略的方法)

2022-05-05 IT知识方法 Policy

盛年不重来，一日难再晨，及时当勉励，岁月不待人。这篇文章主要讲述Policy-based Approach(基于策略的方法)相关的知识，希望能为你提供帮助。

Policy-based Approach(基于策略的方法)

文章图片

Policy-based Approach(基于策略的方法)

文章图片

Policy-based Approach(基于策略的方法)

文章图片

step 1:Neural Network as Actor

Policy-based Approach(基于策略的方法)

文章图片

step 2:goodness of function(训练一些Actor)

Policy-based Approach(基于策略的方法)

文章图片
是一个序列，包含T个状态s、行为a、奖励s。代表某一次的开始到结束的过程。

Policy-based Approach(基于策略的方法)

文章图片
是一个奖励和，全部episode从开始到结束的总reward。

Policy-based Approach(基于策略的方法)

文章图片
是某一设定好的参数

Policy-based Approach(基于策略的方法)

文章图片
获得的总平均奖励
用策略

Policy-based Approach(基于策略的方法)

文章图片
去玩N次游戏获得N个

Policy-based Approach(基于策略的方法)

文章图片
，则从概率

Policy-based Approach(基于策略的方法)

文章图片
中进行采样。

Policy-based Approach(基于策略的方法)

文章图片

【Policy-based Approach(基于策略的方法)】

step 3:pick the best function(找到最好的一个Actor)
方法：Gradient Ascent
即最大化

Policy-based Approach(基于策略的方法)

文章图片
，用Gradient Ascent方法寻找使

Policy-based Approach(基于策略的方法)

文章图片
最大的

Policy-based Approach(基于策略的方法)

文章图片

Policy-based Approach(基于策略的方法)

文章图片

Policy-based Approach(基于策略的方法)

文章图片

Policy-based Approach(基于策略的方法)

文章图片

Policy-based Approach(基于策略的方法)

文章图片

Policy-based Approach(基于策略的方法)

文章图片

添加偏置
这里的

Policy-based Approach(基于策略的方法)

文章图片
有可能总是正数，加上一个偏置b即可，b可以自己设，一个较简单的设法取每个

Policy-based Approach(基于策略的方法)

文章图片
的平均值

Policy-based Approach(基于策略的方法)

文章图片
。
如果相减还是得到一个正数则可以提高该行为的概率，否则降低该行为的概率

Policy-based Approach(基于策略的方法)

文章图片

Policy-based Approach(基于策略的方法)

文章图片

推荐阅读

上一篇：Android 7.0 CTA认证蓝牙权限未明示的问题

下一篇：[译]AndroidStudio 3.6 新特性概览