Skip to main contentdfsdf

yanivdll 's List: Final Project

    • Here is how the tic-tac-toe problem would be approached using reinforcement learning and approximate value functions. First we set up a table of numbers, one for each possible state of the game. Each number will be the latest estimate of the probability of our winning from that state. We treat this estimate as the state's value, and the whole table is the learned value function. State A has higher value than state B, or is considered "better" than state B, if the current estimate of the probability of our winning from A is higher than it is from B. Assuming we always play X s, then for all states with three Xs in a row the probability of winning is 1, because we have already won. Similarly, for all states with three Os in a row, or that are "filled up," the correct probability is 0, as we cannot win from them. We set the initial values of all the other states to 0.5, representing a guess that we have a 50% chance of winning.
      • גישה לאיקס עיגול ע"י למידה<br>

    • temporal-difference

    22 more annotations...

    • The fourth and final element of some reinforcement learning systems is a model of the environment
      • model of the enviroment<br>

    • all the reinforcement learning methods we consider in this book are structured around estimating value functions

    6 more annotations...

    • These examples
      • start reaing HERE<br>

    • Another key feature of reinforcement learning is that it explicitly considers the whole problem of a goal-directed agent interacting with an uncertain environment
    • The agent has to exploit what it already knows in order to obtain reward, but it also has to explore in order to make better action selections in the future

    5 more annotations...

    • thematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of the decision maker. M DPs are useful for studying a wide range of >  optimization problems >  solved via >  dynamic programming >  and >  reinforcement learning > . MDPs were known at least as early as in the fifties (cf. Bellman 1957). Much research in the area was spawned due to >  Ronald A. Howard > 's book, >  Dynam >
    • The goal is to maximize some cumulative function of the rewards, typically the discounted sum under a discounting factor γ (usually just under 1)

    1 more annotation...

    • direct approach
    • The direct approach is the basis for the algorithms used in Evolutionary robotics.

    5 more annotations...

    • also noted in the survey article
      - yanivdll on 2007-11-13
    • At each time step, the agent implements a mapping from states to probabilities of selecting each possible action. This mapping is called the agent's policy and is denoted , where is the probability that if . Reinforcement learning methods specify how the agent changes its policy as a result of its experience. The agent's goal, roughly speaking, is to maximize the total amount of reward it receives over the long run.

    4 more annotations...

    • הפרק הזה מתאר את נושא הלמידה
      - yanivdll on 2007-12-07
    • interval estimation
    • and the pursuit methods keep taking steps toward the current greedy action
      • didn't anderstand the pursuit methos<br>

    • The initial value of the reference reward, , can be set either optimistically, to encourage exploration, or according to prior knowledge
      • This equation implements
        the idea that high rewards should increase the probability of reselecting the
        action taken, and low rewards should decrease its probability.

    6 more annotations...

    • Indeed, any method that focuses on the initial state in any special way is unlikely to help with the general nonstationary case
    • optimistic initial values
    • In this book we denote the step-size parameter by the symbol $\alpha$ or, more generally

    1 more annotation...

    • We know of no careful comparative studies of these two simple action-selection rules.
    • As we will see in the next few chapters, effective nonstationarity is the case most commonly encountered in reinforcement learning. Even if the underlying task is stationary and deterministic, the learner faces a set of banditlike decision tasks each of which changes over time due to the learning process itself. Reinforcement learning requires a balance between exploration and exploitation.
    • The greedy method performs significantly worse in the long run because it often gets stuck performing suboptimal actions

    4 more annotations...

    • In this book we do not worry about balancing exploration and exploitation in a sophisticated way;
    • There are many sophisticated methods for balancing exploration and exploitation for particular mathematical formulations of the -armed bandit and related problems. However, most of these methods make strong assumptions about stationarity and prior knowledge that are either violated or impossible to verify in applications and in the full reinforcement learning problem that we consider in subsequent chapters

    1 more annotation...

    • the  -armed bandit problem >
    • The example of Phil's breakfast in this chapter was inspired by Agre (1988). We direct the reader to Chapter 6 for references to the kind of temporal-difference method we used in the tic-tac-toe example.
    • Some modern neural-network textbooks use the term "trial-and-error" to describe networks that learn from training examples because they use error information to update connection weights. This is an understandable confusion, but it substantially misses the essential selectional character of trial-and-error learning
      • איפה בעצם השוני בין ניסוי לתהיה לבין מערכות לומדות?<br>

    • How do you distribute credit for success among the many decisions that may have been involved in producing it? All of the methods we discuss in this book are, in a sense, directed toward solving this problem.

    3 more annotations...

    • The concepts of value and value functions are the key features of the reinforcement learning methods that we consider in this book.
1 - 20 of 21 Next ›
20 items/page
List Comments (0)