Source: CSDN Bloghttp://pytorch.org/docs/0.3.0/distributions.htmlprobs = policy_network(state)
m = Categorical(probs)
action = m.sample() # 抽样一个action
next_state, reward = env.step(action) # 得到一个reward
loss = -m.log_prob(action) * reward
loss.backward()作者:guotong1988 发表于2018/1/5 11:16:05 原文链接阅读:0 评论:0 查看评论
Read full article »
Followers on Owler
5