• Active Reinforcement learning
• Passive Reinforcement learning
•Frequency of rewards:
–E.g., chess: reinforcement received at end of game
–E.g., table tennis: each point scored can be viewed as rewardco
. learning goals knowledge
§ Performance Element
§ Problem generator
§ Performance standard
reward part of the input percept•agent must be hardwired to recognize that as reward and not as another sensory input•E.g., animal psychologists have studied reinforcement
Passive reinforcement learning
•Direct utility estimation •Adaptive dynamic programming •Temporal difference learning
– Active reinforcement learning •Exploration •Learning an Action-Value Function
Active Reinforcement learning
The agent‘s policy is fixed
–in state s, it always executes the action π(s) •Goal: how good is the policy?
•The passive learning agent has
–no knowledge about the transition model T(s,a,s‘)
–no knowledge about the reward function R(s)
•It executes sets of trialsin the environment using its policy π.
–it starts in state (1,1) and experiences a sequence of state transitions until it reaches one of the terminal states (4,2) or (4,3).
•E.g., (1,1)-0.04 (1,2)-0.04 (1,3)-0.04 (2,3)-0.04 (3,3).0.04 (3,2)-0.04 (3,3)-0.04 (4,3)+1
•Use the information about rewards tolearntheexpected utility Uπ(s):
Utility is the expected sum of (discounted)rewards obtained if policy πis followed
Adaptive dynamic programming
•Idea: Learn how states are connected •Adaptive dynamic programming (ADP) agent –learns the transition modelT(s, π(s), s’)of the environment
–solves the Markov decision process using a dynamic programming method
•Learning transition model is easy fully observable environment
–supervised learning taskwith input = state-action pair, output = resulting state –transition model can be represented as table of probabilities
•how often do action items occur estimate transition probability T(s,a,s‘) from the frequency with which s‘is reached when executing a in s.
•E.g., from state (1,3) Rightis executed three times. The resulting state is two times (2,3) T((1,3) ,Right, (2,3)) is estimated to be 2/3.
Copyright © 2018-2020 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.