yanivdll 's List: Final Project

1.4 An Extended Example: Tic-Tac-Toe 24

Nov 16, 07

www.cs.ualberta.ca/...node10.html finalproject
- Here is how the tic-tac-toe problem would be approached using reinforcement learning and approximate value functions. First we set up a table of numbers, one for each possible state of the game. Each number will be the latest estimate of the probability of our winning from that state. We treat this estimate as the state's value, and the whole table is the learned value function. State A has higher value than state B, or is considered "better" than state B, if the current estimate of the probability of our winning from A is higher than it is from B. Assuming we always play X s, then for all states with three Xs in a row the probability of winning is 1, because we have already won. Similarly, for all states with three Os in a row, or that are "filled up," the correct probability is 0, as we cannot win from them. We set the initial values of all the other states to 0.5, representing a guess that we have a 50% chance of winning.
 
 גישה לאיקס עיגול ע"י למידה 
- temporal-difference
- How might we construct a player that will find the imperfections in its opponent's play and learn to maximize its chances of winning?
- How well a > reinforcement learning system can work in problems with such large state sets is > intimately tied to how appropriately it can generalize from past experience >
- If the step-size parameter is not reduced all the way to zero over time, then this player also plays well against opponents that slowly change their way of playing.
- other words
 
 vv 
- greedily, selecting the move that leads to the state with greatest value, that is, with the highest estimated probability of winning
- if the step-size parameter is reduced properly over time, this method converges, for any fixed opponent, to the true probabilities of winning from each state given optimal play by our player.
- For example, the classical "minimax" solution from game theory is not correct here because it assumes a particular way of playing by the opponent.
 
 good in order to show Asher Wilk 
- Tesauro's
 
 Biblio
- Gerry Tesauro (1992, 1995) combined the algorithm described above with an artificial neural network to learn to play backgammon, which has approximately states
 
 Biblio
- Although tic-tac-toe is a two-person game, reinforcement learning also applies in the case in which there is no external adversary, that is, in the case of a "game against nature."
- Second, there is a clear goal, and correct behavior requires planning or foresight that takes into account delayed effects of one's choices
- This simple example illustrates some of the key features of reinforcement learning methods. First, there is the emphasis on learning while interacting with an environment, in this case with an opponent player
- For example, if the player wins, then all of its behavior in the game is given credit, independently of how specific moves might have been critical to the win
 
 האם ניתן להשוות לסרט מצוייר - אם הדמות עשתה נסיון בכיוון הנכון, ונכשלה בגלל שטות, היא לא תבחר באותה דרך שוב, למרות שזו הדרך לפיתרון? 
- In the end, both evolutionary and value function methods search the space of policies, but learning a value function takes advantage of information available during the course of play.
 
 צריך לקבל מיובל הבהרה לגבי ההבדל בין גישה אבולוציונית לגישת ה-RL 
- An evolutionary approach to this problem would directly search the space of possible policies for one with a high probability of winning against the opponent. Here, a policy is a rule that tells the player what move to make for every state of the game--every possible configuration of X s and Os on the three-by-three board. For each policy considered, an estimate of its winning probability would be obtained by playing some number of games against the opponent. This evaluation would then direct which policy or policies were considered next. A typical evolutionary method would hill-climb in policy space, successively generating and evaluating policies in an attempt to obtain incremental improvements. Or, perhaps, a genetic-style algorithm could be used that would maintain and evaluate a population of policies. Literally hundreds of different optimization methods could be applied. By directly searching the policy space we mean that entire policies are proposed and compared on the basis of scalar evaluations.
 
 read again 
 
 diffrent solutions - evolutionary and RL 
- If the step-size parameter is not reduced all the way to zero over time, then this player also plays well against opponents that slowly change their way of playing.
- if the step-size parameter is reduced properly over time, this method converges, for any fixed opponent
- temporal-difference
- step-size parameter
- Exploratory moves do not result in any learning, but each of our other moves does, causing backups as suggested by the curved arrows and detailed in the text.
 
 ? 
- Although this is a simple problem, it cannot readily be solved in a satisfactory way through classical techniques. For example, the classical "minimax" solution from game theory is not correct here because it assumes a particular way of playing by the opponent.
 
 can be showen to Asher Wilk 
- How might we construct a player that will find the imperfections in its opponent's play and learn to maximize its chances of winning
22 more annotations...
1.3 Elements of Reinforcement Learning 8

Nov 16, 07

www.cs.ualberta.ca/...node9.html finalproject rl
- The fourth and final element of some reinforcement learning systems is a model of the environment
 
 model of the enviroment 
- all the reinforcement learning methods we consider in this book are structured around estimating value functions
- it is values with which we are most concerned when making and evaluating decisions
- Whereas a reward function indicates what is good in an immediate sense, a value function
 
 value function 
- reward function must necessarily be unalterable by the agent.
- reward function defines the goal in a reinforcement learning problem
 
 reward function
- A policy defines the learning agent's way of behaving at a given time.
 
 Policy 
- Beyond the agent and the environment, one can identify four main subelements of a reinforcement learning system: a policy, a reward function, a value function, and, optionally, a model of the environment.
6 more annotations...
1.2 Examples 1

Nov 13, 07

www.cs.ualberta.ca/...node8.html finalproject rl
- These examples
 
 start reaing HERE
1.1 Reinforcement Learning 7

Nov 13, 07

www.cs.ualberta.ca/...node7.html finalproject rl
- Another key feature of reinforcement learning is that it explicitly considers the whole problem of a goal-directed agent interacting with an uncertain environment
- The agent has to exploit what it already knows in order to obtain reward, but it also has to explore in order to make better action selections in the future
- to discover such actions, it has to try actions that it has not selected before
- One of the challenges that arise in reinforcement learning and not in other kinds of learning is the trade-off between exploration and exploitation
- he formulation is intended to include just these three aspects--sensation, action, and goal--in their simplest possible forms without trivializing any of them.
- Reinforcement learning is defined not by characterizing learning methods, but by characterizing a learning problem
 
 tetris is my RL problem? 
- The learner is not told which actions to take, as in most forms of machine learning, but instead must discover which actions yield the most reward by trying them
5 more annotations...
Markov decision process - Wikipedia, the free encyclopedia 3

Nov 13, 07

en.wikipedia.org/...Markov_decision_process finalproject rl
- thematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of the decision maker. M DPs are useful for studying a wide range of > optimization problems > solved via > dynamic programming > and > reinforcement learning > . MDPs were known at least as early as in the fifties (cf. Bellman 1957). Much research in the area was spawned due to > Ronald A. Howard > 's book, > Dynam >
- The goal is to maximize some cumulative function of the rewards, typically the discounted sum under a discounting factor $γ$ (usually just under 1)
- A Markov Decision Process is a tuple $(S,A,P_\cdot(\cdot,\cdot),R(\cdot))$ , where
1 more annotation...
Reinforcement learning - Wikipedia, the free encyclopedia 7

Nov 13, 07

en.wikipedia.org/...Reinforcement_Learning finalproject rl
- direct approach
- The direct approach is the basis for the algorithms used in Evolutionary robotics.
- After we have defined an appropriate return function to be maximized, we need to specify the algorithm that will be used to find the policy with the maximum return. There are two main approaches, the value function approach and t he direct approach > > >. > >
 
 The direct approach
- The direct approach
- The direct approach
- the > m ulti-armed bandi >t > problem >
 
 mentioned also in the survey article 
- It has been applied successfully to various problems, including robot control, elevator scheduling, telecommunications, backgammon and chess.
 
 applications for RL 
5 more annotations...
- also noted in the survey article
 - yanivdll on 2007-11-13
3.1 The Agent-Environment Interface 6

Dec 07, 07

www.cs.ualberta.ca/...node28.html finalproject
- At each time step, the agent implements a mapping from states to probabilities of selecting each possible action. This mapping is called the agent's policy and is denoted , where is the probability that if . Reinforcement learning methods specify how the agent changes its policy as a result of its experience. The agent's goal, roughly speaking, is to maximize the total amount of reward it receives over the long run.
- The Agent >-Environment Interface
- A complete specification of an environment defines a > task >, one instance of the reinforcement learning problem.
- everything outside the agent, is called the environment
- The learner and decision-maker is called the agent
4 more annotations...
- הפרק הזה מתאר את נושא הלמידה
  - yanivdll on 2007-12-07
2.11 Conclusions 2

Dec 07, 07

www.cs.ualberta.ca/...node25.html finalproject
- interval estimation
- and the pursuit methods keep taking steps toward the current greedy action
 
 didn't anderstand the pursuit methos
2.8 Reinforcement Comparison 8

Dec 07, 07

www.cs.ualberta.ca/...node22.html finalproject
- The initial value of the reference reward, , can be set either optimistically, to encourage exploration, or according to prior knowledge
- This equation implements
  the idea that high rewards should increase the probability of reselecting the
  action taken, and low rewards should decrease its probability.
- Let us denote the preference for action on play by
- But how is the learner to know what constitutes a large or a small reward? If an action is taken and the environment returns a reward of 5, is that large or small? To make such a judgment one must compare the reward with some standard or reference level, called the reference reward >
- Learning methods based on this idea are called reinforcement comparison > methods >
- A natural choice for the reference reward is an average of previously received rewards. In other words, a reward is interpreted as large if it is higher than average, and small if it is lower than average
- A central intuition underlying reinforcement learning is that actions followed by large > rewards should be made more likely to recur, whereas actions followed by > small rewards should be made less likely to recur >
6 more annotations...
2.7 Optimistic Initial Values 2

Dec 07, 07

www.cs.ualberta.ca/...node21.html finalproject
- Indeed, any method that focuses on the initial state in any special way is unlikely to help with the general nonstationary case
- optimistic initial values
2.6 Tracking a Nonstationary Problem

Nov 26, 07

www.cs.ualberta.ca/...node20.html finalproject
2.5 Incremental Implementation 3

Nov 26, 07

www.cs.ualberta.ca/...node19.html finalproject
- In this book we denote the step-size parameter by the symbol $\alpha$ or, more generally
- The update rule (2.4) is of a form that occurs frequently throughout this book. The general form is
1 more annotation...
ApplyingReinforcementLearningToTetris_DonaldCarr_RU_AC_ZA.pdf (application/pdf Object)

Nov 26, 07

colinfahey.com/...Tetris_DonaldCarr_RU_AC_ZA.pdf finalproject
- - yanivdll on 2007-11-26
2.3 Softmax Action Selection 1

Nov 26, 07

www.cs.ualberta.ca/...node17.html finalproject
- We know of no careful comparative studies of these two simple action-selection rules.
2.2 Action-Value Methods 6

Nov 17, 07

www.cs.ualberta.ca/...node16.html finalproject
- As we will see in the next few chapters, effective nonstationarity is the case most commonly encountered in reinforcement learning. Even if the underlying task is stationary and deterministic, the learner faces a set of banditlike decision tasks each of which changes over time due to the learning process itself. Reinforcement learning requires a balance between exploration and exploitation.
- The greedy method performs significantly worse in the long run because it often gets stuck performing suboptimal actions
- We call this the sample-average method
- We call methods using this near > -greedy > > > action selection rule $\varepsilon$ -greedy methods > >.
- $\varepsilon$ -greedy methods
- The simplest action selection rule is to select the action (or one of the actions) with highest estimated action value, that is, to select on play one of the greedy actions, , for which
4 more annotations...
2.1 An <img border=0 src="inimgtmp82.png" width="9" height="8">-Armed Bandit Problem 3

Nov 17, 07

www.cs.ualberta.ca/...node15.html finalproject
- In this book we do not worry about balancing exploration and exploitation in a sophisticated way;
- There are many sophisticated methods for balancing exploration and exploitation for particular mathematical formulations of the -armed bandit and related problems. However, most of these methods make strong assumptions about stationarity and prior knowledge that are either violated or impossible to verify in applications and in the full reinforcement learning problem that we consider in subsequent chapters
- If you maintain estimates of the action values, then at any time there is at least one action whose estimated value is greatest. We call this a greedy action. If you select a greedy action, we say that you are exploiting your current knowledge of the values of the actions. If instead you select one of the nongreedy actions, then we say you are exploring because this enables you to improve your estimate of the nongreedy action's value. Exploitation is the right thing to do to maximize the expected reward on the one play, but exploration may produce the greater total reward in the long run. For example, suppose the greedy action's value is known with certainty, while several other actions are estimated to be nearly as good but with substantial uncertainty. The uncertainty is such that at least one of these other actions probably is actually better than the greedy action, but you don't know which one. If you have many plays yet to make, then it may be better to explore the nongreedy actions and discover which of them are better than the greedy action. Reward is lower in the short run, during exploration, but higher in the long run because after you have discovered the better actions, you can exploit them. Because it is not possible both to explore and to exploit with any single action selection, one often refers to the "conflict" between exploration and exploitation.
 
 the "conflict" between exploration and
 exploitation
1 more annotation...
2. Evaluative Feedback 2

Nov 17, 07

www.cs.ualberta.ca/...node14.html finalproject
- the -armed bandit problem >
1.7 Bibliographical Remarks 1

Nov 16, 07

www.cs.ualberta.ca/...node13.html finalproject
- The example of Phil's breakfast in this chapter was inspired by Agre (1988). We direct the reader to Chapter 6 for references to the kind of temporal-difference method we used in the tic-tac-toe example.
1.6 History of Reinforcement Learning 5

Nov 16, 07

www.cs.ualberta.ca/...node12.html finalproject
- Some modern neural-network textbooks use the term "trial-and-error" to describe networks that learn from training examples because they use error information to update connection weights. This is an understandable confusion, but it substantially misses the essential selectional character of trial-and-error learning
 
 איפה בעצם השוני בין ניסוי לתהיה לבין מערכות לומדות? 
- How do you distribute credit for success among the many decisions that may have been involved in producing it? All of the methods we discuss in this book are, in a sense, directed toward solving this problem.
- Law of Effect
- temporal-difference
- The class of methods for solving optimal control problems by solving this equation came to be known as dynamic programming (Bellman, 1957a)
3 more annotations...
1.5 Summary 1

Nov 16, 07

www.cs.ualberta.ca/...node11.html finalproject
- The concepts of value and value functions are the key features of the reinforcement learning methods that we consider in this book.