Reinforcement Learning
•
Active Reinforcement learning
•
Passive Reinforcement learning
Reinforcement
learning
•Frequency
of rewards:
–E.g.,
chess: reinforcement received at end of game
–E.g.,
table tennis: each point scored can be viewed as rewardco
.
learning goals knowledge
§
Environment
§
Sensors
§
ActuatorsCritic
§
AgentLearning
§
Performance Element
§
Problem generator
§
Performance standard
§
changesfeedback
reward
part of the input percept•agent must be hardwired to recognize that as reward
and not as another sensory input•E.g., animal psychologists have studied
reinforcement
on
animals
Passive
reinforcement learning
•Direct
utility estimation •Adaptive dynamic programming •Temporal difference learning
– Active
reinforcement learning •Exploration •Learning an Action-Value Function
Active
Reinforcement learning
The
agent‘s policy is fixed
–in state
s, it always executes the action π(s) •Goal: how good is the policy?
•The
passive learning agent has
–no
knowledge about the transition model T(s,a,s‘)
–no
knowledge about the reward function R(s)
•It
executes sets of trialsin the environment using its policy π.
–it
starts in state (1,1) and experiences a sequence of state transitions until it
reaches one of the terminal states (4,2) or (4,3).
•E.g.,
(1,1)-0.04 (1,2)-0.04 (1,3)-0.04
(2,3)-0.04 (3,3).0.04 (3,2)-0.04 (3,3)-0.04
(4,3)+1
•Use the
information about rewards tolearntheexpected utility UÏ€(s):
Utility
is the expected sum of (discounted)rewards obtained if policy πis followed
Adaptive
dynamic programming
•Idea:
Learn how states are connected •Adaptive dynamic programming (ADP) agent
–learns the transition modelT(s, π(s), s’)of the environment
–solves
the Markov decision process using a dynamic programming method
•Learning
transition model is easy fully observable environment
–supervised
learning taskwith input = state-action pair, output = resulting state
–transition model can be represented as table of probabilities
•how
often do action items occur estimate transition probability T(s,a,s‘) from the
frequency with which s‘is reached when executing a in s.
•E.g.,
from state (1,3) Rightis executed three times. The resulting state is two times
(2,3) T((1,3) ,Right, (2,3)) is estimated to be 2/3.
Related Topics
Privacy Policy, Terms and Conditions, DMCA Policy and Compliant
Copyright © 2018-2023 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.