Skip to main content

  • 1.4 An Extended Example: Tic-Tac-Toe

    • Here is how the tic-tac-toe problem would be approached using reinforcement
      learning and approximate value functions. First we set up a table of
      numbers, one for each possible state of the game. Each number will be the
      latest estimate of the probability of our winning from that state. We treat
      this estimate as the state's value, and the whole table is the
      learned value function. State A has higher value than state B, or is
      considered "better" than state B, if the current estimate of the probability
      of our winning from A is higher than it is from B. Assuming we always play X
      s, then for all states with three Xs in a row the probability of winning is
      1, because we have already won. Similarly, for all states with three
      Os in a row, or that are "filled up," the correct probability is 0, as we
      cannot win from them. We set the initial values of all the other states to 0.5,
      representing a guess that we have a 50% chance of winning.
    • temporal-difference
    • 22 more annotations...
  • 1.3 Elements of Reinforcement Learning

    • The fourth and final element of some reinforcement learning systems is a model of the environment
    • all the reinforcement learning methods we consider in this book are
      structured around estimating value functions
    • 6 more annotations...
  • 1.1 Reinforcement Learning

    • Another key feature of reinforcement learning is that it
      explicitly considers the whole problem of a goal-directed
      agent interacting with an uncertain environment
    • The agent has to exploit what it already knows in
      order to obtain reward, but it also has to explore in order to make
      better action selections in the future
    • 5 more annotations...
  • Markov decision process - Wikipedia, the free encyclopedia

    • thematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of the decision maker. M
      DPs are useful for studying a wide range of
      >
      optimization problems
      >

      solved via
      >
      dynamic programming
      >

      and
      >
      reinforcement learning
      >

      . MDPs were known at least as early as in the fifties (cf. Bellman 1957). Much research in the area was spawned due to
      >
      Ronald A. Howard
      >

      's book,
      >
      Dynam
      >
    • The goal is to maximize some cumulative function of the rewards, typically the discounted sum under a discounting factor γ (usually just under 1)
    • 1 more annotations...
  • Reinforcement learning - Wikipedia, the free encyclopedia

    • direct approach
    • The direct approach is the basis for the algorithms used in Evolutionary robotics.
    • 5 more annotations...
  • 3.1 The Agent-Environment Interface

    • At each time step, the agent implements a mapping from states
      to probabilities of selecting each possible action. This mapping is
      called the agent's policy and is denoted , where is the
      probability that if
      . Reinforcement learning methods specify how the agent changes its policy as
      a result of its experience. The agent's goal, roughly speaking, is to maximize the
      total amount of reward it receives over the long run.
    • 4 more annotations...
  • 2.11 Conclusions

    • interval estimation
    • and the pursuit methods keep taking steps toward the
      current greedy action
  • 2.8 Reinforcement Comparison

    • The initial value of the reference reward, , can be set
      either optimistically, to encourage exploration, or according to prior
      knowledge
    • 6 more annotations...
  • 2.7 Optimistic Initial Values

    • Indeed, any method that focuses on the
      initial state in any special way is unlikely to help with the general
      nonstationary case
    • optimistic initial values
  • 2.5 Incremental Implementation

    • In this book we
      denote the step-size parameter by the symbol $\alpha$ or, more generally
    • 1 more annotations...
  • 2.3 Softmax Action Selection

    • We know of no careful comparative studies
      of these two simple action-selection rules.
  • 2.2 Action-Value Methods

    • As we
      will see in the next few chapters, effective nonstationarity is the case most commonly
      encountered in reinforcement learning. Even if the underlying task is stationary
      and deterministic, the learner faces a set of banditlike decision tasks each
      of which changes over time due to the learning process itself. Reinforcement
      learning requires a balance between exploration and exploitation.
    • The greedy method performs significantly worse in
      the long run because it often gets stuck performing suboptimal actions
    • 4 more annotations...
  • 2.1 An <img border=0 src="inimgtmp82.png" width="9" height="8">-Armed Bandit Problem

    • In this book we do not worry about balancing exploration and
      exploitation in a sophisticated way;
    • There are many
      sophisticated methods for balancing
      exploration and exploitation for particular mathematical formulations of
      the -armed bandit and related problems. However, most of these methods
      make strong assumptions about stationarity and prior knowledge that are
      either violated or impossible to verify in applications and in the full
      reinforcement learning problem that we consider in subsequent chapters
    • 1 more annotations...
  • 1.7 Bibliographical Remarks

    • The example of Phil's breakfast in this chapter was
      inspired by Agre (1988). We direct the reader to Chapter
      6 for references to the kind of temporal-difference method we used in the
      tic-tac-toe example.
  • 1.6 History of Reinforcement Learning

    • Some modern neural-network
      textbooks use the term "trial-and-error" to describe networks that learn from
      training examples because they use error information to update connection weights.
      This is an understandable confusion, but it substantially misses the essential
      selectional character of trial-and-error learning
    • How do you distribute credit for success among the many decisions that
      may have been involved in producing it? All of the methods we discuss in this book
      are, in a sense, directed toward solving this problem.
    • 3 more annotations...
  • 1.5 Summary

    • The concepts of value and value functions are the key
      features of the reinforcement learning methods that we
      consider in this book.
1 - 20 of 21 Next ›
Showing 20 items per page
List Comments (0)