Chapter: Artificial Intelligence

Reinforcement Learning

• Active Reinforcement learning • Passive Reinforcement learning

Reinforcement Learning

• Active Reinforcement learning

• Passive Reinforcement learning

Reinforcement learning

•Frequency of rewards:

–E.g., chess: reinforcement received at end of game

–E.g., table tennis: each point scored can be viewed as rewardco

. learning goals knowledge

§ Environment

§ Sensors

§ ActuatorsCritic

§ AgentLearning

§ Performance Element

§ Problem generator

§ Performance standard

§ changesfeedback

reward part of the input percept•agent must be hardwired to recognize that as reward and not as another sensory input•E.g., animal psychologists have studied reinforcement

on animals

Passive reinforcement learning

•Direct utility estimation •Adaptive dynamic programming •Temporal difference learning

– Active reinforcement learning •Exploration •Learning an Action-Value Function

Active Reinforcement learning

The agent‘s policy is fixed

–in state s, it always executes the action π(s) •Goal: how good is the policy?

•The passive learning agent has

–no knowledge about the transition model T(s,a,s‘)

–no knowledge about the reward function R(s)

•It executes sets of trialsin the environment using its policy π.

–it starts in state (1,1) and experiences a sequence of state transitions until it reaches one of the terminal states (4,2) or (4,3).

•E.g., (1,1)-0.04 (1,2)-0.04 (1,3)-0.04 (2,3)-0.04 (3,3).0.04 (3,2)-0.04 (3,3)-0.04 (4,3)+1

•Use the information about rewards tolearntheexpected utility Uπ(s):

Utility is the expected sum of (discounted)rewards obtained if policy πis followed

Adaptive dynamic programming

•Idea: Learn how states are connected •Adaptive dynamic programming (ADP) agent –learns the transition modelT(s, π(s), s’)of the environment

–solves the Markov decision process using a dynamic programming method

•Learning transition model is easy fully observable environment

–supervised learning taskwith input = state-action pair, output = resulting state –transition model can be represented as table of probabilities

•how often do action items occur estimate transition probability T(s,a,s‘) from the frequency with which s‘is reached when executing a in s.

•E.g., from state (1,3) Rightis executed three times. The resulting state is two times (2,3) T((1,3) ,Right, (2,3)) is estimated to be 2/3.

Study Material, Lecturing Notes, Assignment, Reference, Wiki description explanation, brief detail

Artificial Intelligence : Reinforcement Learning |