Summary of Reinforcement learning
Lingo
In reinforcement learning, a software agent makes observations and takes actions within an environment, in return it receives rewards.
List form
Word | Explanation or Example |
Agent | For example a program controlling a robot |
Observation | Sensor signals |
Action |
Positive: Approaching destination Negative: Wastes time |
Action Space |
Possible actions to make |
Environment | The real world |
Policy | The algorithm a software agent uses to determine its actions. |
Stochastic Policy | A policy involving randomness |
Policy Search |
A brute force sweep over policies to find the "best one" |
Policy Space |
The collection of all admissible policies |
off-policy RL algorithm |
Means we update state and/or action-values based on a different policy than our behavior policy. In e.g. Q-learning the behavior policy is normally e-greedy, but the update policy is just greedy. |
Genetic algorithm |
Find good policy, then let it evolve (sexual (combined with another good policy) or asexual (from itself)) to find new policies. |
Gradient Ascent |
Like gradient descent but searching for maxima rather than minima (e. g. maximize rewards by tweaking policy parameters). |
Policy Gradient algorithm |
If a policy is defined in such a way that it is differentiable, it can be improved using gradient descent. |
OpenAI Gym |
|
Exploration | |
Exploitation | |
Credit Assignment problem | |
Discount Factor | How much to care about coming actions |
Return (of an action) | Summed reward for this and following (optimal) actions |
Action Advantage | |
Reward | Points or penalties for one action |
Q-Value | |
Markov chains | |
Markov decision process | |
Actor-Critic algorithm | |
Markov Decision Process | A process defined with states, actions and rewards and which displays the Markov property. |
Terminal state | |
Q-value | Also known as the action-value: Value achieved by taking a certain action. |
TD algorithm | Temporal Difference learning can be compared against Monte Carlo (MC) learning and Dynamic Programming (DP). Broadly speaking DP is breadth-first with shallow updates, MC is depth-first with deep updates and TD is depth-first with shallow updates. Q-learning is part of TD. |
Q-learning | A TD algorithm using off-policy learning. |
|
|
Approximate Q-learning | Q-learning where Q-values are approximated using function-approximation. |
Deep Q-Network | A network that approximates Q-values. |
Deep Q-Learning | Q-learning where Q-values are approximated using DL for function-approximation. |
Catastrophic forgetting | |
Replay Buffer | The database used to train function-approximators. |
Trajectory | |
Episode | |
Driver |
Objective
Maximize reward.
Algorithms
REINFOCE algorithms
- Let the network policy play the game several times. At each step, compute the gradients but don't apply them yet.,
- After several episodes, compute each action's advantage.
- If an action's advantage is positive, it was probably good. Apply the gradients computed earlier to make it more likely in the future. If negative, apply opposite gradients to make the action less likely in the future. I.e. multiply each gradient by the vector corresponding to action's advantage.
- Compute the mean of all resulting gradient vectors, and use it to perform a gradient descent step.
TD Algorithm (Temporal Difference Learning)
The TD algorithm tries to learn the optimal value function.
Vk+1(s)←(1−α)Vk(s)+α(r+γVk(s′))
In this equation α is the learning rate,
r is the reward,
r+γ⋅Vk(s′)is called the TD target.
γk(s,r,s′)=r+γ⋅Vk(s′)−Vk(s) is called the TD error.
γ is the discount factor and
Vk(s) is the current estimate of the value function at state
s.
Q-learning
The Q-learning algorithm is an adaptaion of the Q-value iteration algorithm to the situation where the trainition probabilities and rewards are initially unknown. Q-learning works by watching an agent play (randomly) and gradually improving its estimates of the Q-values. One done, the optimal policy is chosen greedily.
Qk+1(s,a):=(1−α)Qk(s,a)+α(r+γmaxa′(s′,a′))
Approximate Q-learning
Instead of tuning probabilities and rewards for each state-action pair (s,a), we parametrize the relationship using a class of models, where we use learning to find the best model out of this class. If we use neural networks, we usually call it Deep Q-Learning.
Instability fix for Q-learning
Recall Brian Anderson's seminar last fall. There is (need to be) a natural order of time constants of an adaptive scheme.
Sampling rate > system dynamics > model updates > changes to physical parameters
That is, you must sample fast enough to capture the system dynamics, but you must update your parameter estimates at a slower rate than the dynamics of the system. You must also update your mode parameters at a faster rate than they change in real life.
This is implemented in Double DQN, where you use two models. One to predict the Q-values of the actions applied to the current state (online model) and the other to estimate the Q-Values for these best actions (target model).
Prioritized Experience Replay
Sample based on magnitude of the estimation error, rather than uniformely. You need to watch out, because now your training data will be biased towards these events. Thus you need to weight down these events for your training.
Architecture - Q-learning using TF-agents
Pseudoimplementation of Deep Q-learning sing TF-agents library
- Import suite_atari, AtariPreprocessing, FrameStack 4. Load environment, with framestack, apply AtariPreprocessing.. See page 648
- Convert to
TFPyEnvironment
, page 648 - Specify a
QNetwork
, page 650 - Specify an agent, e.g.
DqnAgent
, page 652 - Create a replay buffer, e.g.
tf_uniform_replay_buffer
, page 654 - Specify Metrics, page 655
- Specify a collect driver, e.g.
DynamicStepDriver
, page 656 - Precollect random data and generate a tf data set objects, page 658
- Write a training loop, page 661