Home | | Artificial Intelligence | Reinforcement Learning

Chapter: Artificial Intelligence

Reinforcement Learning

• Active Reinforcement learning • Passive Reinforcement learning

Reinforcement Learning


                     Active Reinforcement learning


                     Passive Reinforcement learning


Reinforcement learning


•Frequency of rewards:


–E.g., chess: reinforcement received at end of game


–E.g., table tennis: each point scored can be viewed as rewardco


. learning goals knowledge


§               Environment


§               Sensors


§               ActuatorsCritic


§               AgentLearning


§               Performance Element


§               Problem generator


§               Performance standard


§               changesfeedback


reward part of the input percept•agent must be hardwired to recognize that as reward and not as another sensory input•E.g., animal psychologists have studied reinforcement



on animals


Passive reinforcement learning


•Direct utility estimation •Adaptive dynamic programming •Temporal difference learning


– Active reinforcement learning •Exploration •Learning an Action-Value Function

Active Reinforcement learning


The agent‘s policy is fixed


–in state s, it always executes the action π(s) •Goal: how good is the policy?


•The passive learning agent has


–no knowledge about the transition model T(s,a,s‘)


–no knowledge about the reward function R(s)


•It executes sets of trialsin the environment using its policy π.


–it starts in state (1,1) and experiences a sequence of state transitions until it reaches one of the terminal states (4,2) or (4,3).


•E.g., (1,1)-0.04  (1,2)-0.04  (1,3)-0.04  (2,3)-0.04  (3,3).0.04 (3,2)-0.04  (3,3)-0.04  (4,3)+1


•Use the information about rewards tolearntheexpected utility Uπ(s):


Utility is the expected sum of (discounted)rewards obtained if policy πis followed


Adaptive dynamic programming


•Idea: Learn how states are connected •Adaptive dynamic programming (ADP) agent –learns the transition modelT(s, π(s), s’)of the environment


–solves the Markov decision process using a dynamic programming method


•Learning transition model is easy fully observable environment


–supervised learning taskwith input = state-action pair, output = resulting state –transition model can be represented as table of probabilities



•how often do action items occur estimate transition probability T(s,a,s‘) from the frequency with which s‘is reached when executing a in s.


•E.g., from state (1,3) Rightis executed three times. The resulting state is two times (2,3) T((1,3) ,Right, (2,3)) is estimated to be 2/3.


Study Material, Lecturing Notes, Assignment, Reference, Wiki description explanation, brief detail
Artificial Intelligence : Reinforcement Learning |

Privacy Policy, Terms and Conditions, DMCA Policy and Compliant

Copyright © 2018-2023 BrainKart.com; All Rights Reserved. Developed by Therithal info, Chennai.