Summary of Reinforcement learning

Lingo

In reinforcement learning, a software agent makes observations and takes actions within an environment, in return it receives rewards.

List form

Word Explanation or Example
Agent For example a program controlling a robot
Observation Sensor signals
Action

Positive: Approaching destination

Negative: Wastes time

Action Space

Possible actions to make

Environment The real world
Policy The algorithm a software agent uses to determine its actions.
Stochastic Policy A policy involving randomness
Policy Search

A brute force sweep over policies to find the "best one"

Policy Space

The collection of all admissible policies

off-policy RL algorithm

Means we update state and/or action-values based on a different policy than our behavior policy. In e.g. Q-learning the behavior policy is normally e-greedy, but the update policy is just greedy.

Genetic algorithm

Find good policy, then let it evolve (sexual (combined with another good policy) or asexual (from itself)) to find new policies.

Gradient Ascent

Like gradient descent but searching for maxima rather than minima (e. g. maximize rewards by tweaking policy parameters).

Policy Gradient algorithm

If a policy is defined in such a way that it is differentiable, it can be improved using gradient descent. 

OpenAI Gym

 

Exploration
Exploitation
Credit Assignment problem
Discount Factor How much to care about coming actions
Return (of an action) Summed reward for this and following (optimal) actions
Action Advantage
Reward Points or penalties for one action
Q-Value
Markov chains
Markov decision process
Actor-Critic algorithm
Markov Decision Process A process defined with states, actions and rewards and  which displays the Markov property.
Terminal state
Q-value Also known as the action-value: Value achieved by taking a certain action.
TD algorithm Temporal Difference learning can be compared against Monte Carlo (MC) learning and Dynamic Programming (DP). Broadly speaking DP is breadth-first with shallow updates, MC is depth-first with deep updates and TD is depth-first with shallow updates. Q-learning is part of TD.
Q-learning A TD algorithm using off-policy learning.
LaTeX: \varepsilonε-greedy policy + common start,final value
Approximate Q-learning Q-learning where Q-values are approximated using function-approximation.
Deep Q-Network A network that approximates Q-values.
Deep Q-Learning Q-learning where Q-values are approximated using DL for function-approximation.
Catastrophic forgetting
Replay Buffer The database used to train function-approximators.
Trajectory
Episode
Driver

Objective

Maximize reward.

Algorithms

REINFOCE algorithms

  1. Let the network policy play the game several times. At each step, compute the gradients but don't apply them yet.,
  2. After several episodes, compute each action's advantage.
  3. If an action's advantage is positive, it was probably good. Apply the gradients computed earlier to make it more likely in the future. If negative, apply opposite gradients to make the action less likely in the future. I.e. multiply each gradient by the vector corresponding to action's advantage.
  4. Compute the mean of all resulting gradient vectors, and use it to perform a gradient descent step.

TD Algorithm (Temporal Difference Learning)

The TD algorithm tries to learn the optimal value function.

LaTeX: V_{k+1}(s) \leftarrow (1-\alpha)V_k(s) + \alpha\left(r + \gamma V_k(s')\right)Vk+1(s)(1α)Vk(s)+α(r+γVk(s))

In this equation LaTeX: \alphaα is the learning rate, LaTeX: rr is the reward, LaTeX: r + \gamma\cdot V_k(s')r+γVk(s)is called the TD targetLaTeX: \gamma_k(s,r,s') = r+\gamma\cdot V_k(s') - V_k(s)γk(s,r,s)=r+γVk(s)Vk(s) is called the TD errorLaTeX: \gammaγ is the discount factor and LaTeX: V_k(s)Vk(s) is the current estimate of the value function at state LaTeX: ss.

 

Q-learning

The Q-learning algorithm is an adaptaion of the Q-value iteration algorithm to the situation where the trainition probabilities and rewards are initially unknown. Q-learning works by watching an agent play (randomly) and gradually improving its estimates of the Q-values.  One done, the optimal policy is chosen greedily.

LaTeX: Q_{k+1}(s,a) := (1-\alpha)Q_{k}(s,a) + \alpha\big(r + \gamma \max_{a'}(s', a')\big)Qk+1(s,a):=(1α)Qk(s,a)+α(r+γmaxa(s,a))

Approximate Q-learning

Instead of tuning probabilities and rewards for each state-action pair LaTeX: (s,a)(s,a), we parametrize the relationship using a class of models, where we use learning to find the best model out of this class. If we use neural networks, we usually call it Deep Q-Learning.

Instability fix for Q-learning

Recall Brian Anderson's seminar last fall. There is (need to be) a natural order of time constants of an adaptive scheme.

Sampling rate > system dynamics > model updates > changes to physical parameters

That is, you must sample fast enough to capture the system dynamics, but you must update your parameter estimates at a slower rate than the dynamics of the system. You must also update your mode parameters at a faster rate than they change in real life. 

This is implemented in Double DQN, where you use two models. One to predict the Q-values of the actions applied to the current state (online model) and the other to estimate the Q-Values for these best actions (target model).

Prioritized Experience Replay

Sample based on magnitude of the estimation error, rather than uniformely. You need to watch out, because now your training data will be biased towards these events. Thus you need to weight down these events for your training. 

 

Architecture - Q-learning using TF-agents

arch.png

Pseudoimplementation of Deep Q-learning sing TF-agents library

  1. Import suite_atari, AtariPreprocessing, FrameStack 4. Load environment, with framestack, apply AtariPreprocessing.. See page 648
  2. Convert to TFPyEnvironment, page 648
  3. Specify a QNetwork, page 650
  4. Specify an agent, e.g. DqnAgent, page 652
  5. Create a replay buffer, e.g. tf_uniform_replay_buffer, page 654
  6. Specify Metrics, page 655
  7. Specify a collect driver, e.g. DynamicStepDriver, page 656
  8. Precollect random data and generate a tf data set objects, page 658
  9. Write a training loop, page 661

Other useful hints

QNetwork class Link Links to an external site.

qnet.png