-
1.4 An Extended Example: Tic-Tac-Toe
-
Here is how the tic-tac-toe problem would be approached using reinforcement
learning and approximate value functions. First we set up a table of
numbers, one for each possible state of the game. Each number will be the
latest estimate of the probability of our winning from that state. We treat
this estimate as the state's value, and the whole table is the
learned value function. State A has higher value than state B, or is
considered "better" than state B, if the current estimate of the probability
of our winning from A is higher than it is from B. Assuming we always play X
s, then for all states with three Xs in a row the probability of winning is
1, because we have already won. Similarly, for all states with three
Os in a row, or that are "filled up," the correct probability is 0, as we
cannot win from them. We set the initial values of all the other states to 0.5,
representing a guess that we have a 50% chance of winning. -
temporal-difference
- 22 more annotations...
-
-
1.3 Elements of Reinforcement Learning
-
The fourth and final element of some reinforcement learning systems is a model of the environment
-
all the reinforcement learning methods we consider in this book are
structured around estimating value functions - 6 more annotations...
-
-
1.1 Reinforcement Learning
-
Another key feature of reinforcement learning is that it
explicitly considers the whole problem of a goal-directed
agent interacting with an uncertain environment -
The agent has to exploit what it already knows in
order to obtain reward, but it also has to explore in order to make
better action selections in the future - 5 more annotations...
-
-
Markov decision process - Wikipedia, the free encyclopedia
-
thematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of the decision maker. M
DPs are useful for studying a wide range of
>
optimization problems
>
solved via
>
dynamic programming
>
and
>
reinforcement learning
>
. MDPs were known at least as early as in the fifties (cf. Bellman 1957). Much research in the area was spawned due to
>
Ronald A. Howard
>
's book,
>
Dynam
> -
The goal is to maximize some cumulative function of the rewards, typically the discounted sum under a discounting factor γ (usually just under 1)
- 1 more annotations...
-
-
Reinforcement learning - Wikipedia, the free encyclopedia
-
direct approach
-
The direct approach is the basis for the algorithms used in Evolutionary robotics.
- 5 more annotations...
-
-
3.1 The Agent-Environment Interface
-
At each time step, the agent implements a mapping from states
to probabilities of selecting each possible action. This mapping is
called the agent's policy and is denoted
, where
is the
probability that
if
. Reinforcement learning methods specify how the agent changes its policy as
a result of its experience. The agent's goal, roughly speaking, is to maximize the
total amount of reward it receives over the long run. -

- 4 more annotations...
-
-
2.11 Conclusions
-
interval estimation
-
and the pursuit methods keep taking steps toward the
current greedy action
-
-
2.8 Reinforcement Comparison
-
The initial value of the reference reward,
, can be set
either optimistically, to encourage exploration, or according to prior
knowledge -

- 6 more annotations...
-
-
2.7 Optimistic Initial Values
-
Indeed, any method that focuses on the
initial state in any special way is unlikely to help with the general
nonstationary case -
optimistic initial values
-
-
2.5 Incremental Implementation
-
In this book we
denote the step-size parameter by the symbol
or, more generally -

- 1 more annotations...
-
-
2.3 Softmax Action Selection
-
We know of no careful comparative studies
of these two simple action-selection rules.
-
-
2.2 Action-Value Methods
-
As we
will see in the next few chapters, effective nonstationarity is the case most commonly
encountered in reinforcement learning. Even if the underlying task is stationary
and deterministic, the learner faces a set of banditlike decision tasks each
of which changes over time due to the learning process itself. Reinforcement
learning requires a balance between exploration and exploitation. -
The greedy method performs significantly worse in
the long run because it often gets stuck performing suboptimal actions - 4 more annotations...
-
-
2.1 An <img border=0 src="inimgtmp82.png" width="9" height="8">-Armed Bandit Problem
-
In this book we do not worry about balancing exploration and
exploitation in a sophisticated way; -
There are many
sophisticated methods for balancing
exploration and exploitation for particular mathematical formulations of
the
-armed bandit and related problems. However, most of these methods
make strong assumptions about stationarity and prior knowledge that are
either violated or impossible to verify in applications and in the full
reinforcement learning problem that we consider in subsequent chapters - 1 more annotations...
-
-
1.7 Bibliographical Remarks
-
The example of Phil's breakfast in this chapter was
inspired by Agre (1988). We direct the reader to Chapter
6 for references to the kind of temporal-difference method we used in the
tic-tac-toe example.
-
-
1.6 History of Reinforcement Learning
-
Some modern neural-network
textbooks use the term "trial-and-error" to describe networks that learn from
training examples because they use error information to update connection weights.
This is an understandable confusion, but it substantially misses the essential
selectional character of trial-and-error learning -
How do you distribute credit for success among the many decisions that
may have been involved in producing it? All of the methods we discuss in this book
are, in a sense, directed toward solving this problem. - 3 more annotations...
-
-
1.5 Summary
-
The concepts of value and value functions are the key
features of the reinforcement learning methods that we
consider in this book.
-
List Comments
(0)

, where