Summary of Reinforcement learning

Lingo

In reinforcement learning, a software agent makes observations and takes actions within an environment, in return it receives rewards.

List form

Word	Explanation or Example
Agent	For example a program controlling a robot
Observation	Sensor signals
Action	Positive: Approaching destination Negative: Wastes time
Action Space	Possible actions to make
Environment	The real world
Policy	The algorithm a software agent uses to determine its actions.
Stochastic Policy	A policy involving randomness
Policy Search	A brute force sweep over policies to find the "best one"
Policy Space	The collection of all admissible policies
off-policy RL algorithm	Means we update state and/or action-values based on a different policy than our behavior policy. In e.g. Q-learning the behavior policy is normally e-greedy, but the update policy is just greedy.
Genetic algorithm	Find good policy, then let it evolve (sexual (combined with another good policy) or asexual (from itself)) to find new policies.
Gradient Ascent	Like gradient descent but searching for maxima rather than minima (e. g. maximize rewards by tweaking policy parameters).
Policy Gradient algorithm	If a policy is defined in such a way that it is differentiable, it can be improved using gradient descent.
OpenAI Gym
Exploration
Exploitation
Credit Assignment problem
Discount Factor	How much to care about coming actions
Return (of an action)	Summed reward for this and following (optimal) actions
Action Advantage
Reward	Points or penalties for one action
Q-Value
Markov chains
Markov decision process
Actor-Critic algorithm
Markov Decision Process	A process defined with states, actions and rewards and which displays the Markov property.
Terminal state
Q-value	Also known as the action-value: Value achieved by taking a certain action.
TD algorithm	Temporal Difference learning can be compared against Monte Carlo (MC) learning and Dynamic Programming (DP). Broadly speaking DP is breadth-first with shallow updates, MC is depth-first with deep updates and TD is depth-first with shallow updates. Q-learning is part of TD.
Q-learning	A TD algorithm using off-policy learning.
$LaTeX: \varepsilon$ $\varepsilon$ -greedy policy + common start,final value
Approximate Q-learning	Q-learning where Q-values are approximated using function-approximation.
Deep Q-Network	A network that approximates Q-values.
Deep Q-Learning	Q-learning where Q-values are approximated using DL for function-approximation.
Catastrophic forgetting
Replay Buffer	The database used to train function-approximators.
Trajectory
Episode
Driver

Objective

Maximize reward.

Algorithms

REINFOCE algorithms

Let the network policy play the game several times. At each step, compute the gradients but don't apply them yet.,
After several episodes, compute each action's advantage.
If an action's advantage is positive, it was probably good. Apply the gradients computed earlier to make it more likely in the future. If negative, apply opposite gradients to make the action less likely in the future. I.e. multiply each gradient by the vector corresponding to action's advantage.
Compute the mean of all resulting gradient vectors, and use it to perform a gradient descent step.

TD Algorithm (Temporal Difference Learning)

The TD algorithm tries to learn the optimal value function.

$LaTeX: V_{k+1}(s) \leftarrow (1-\alpha)V_k(s) + \alpha\left(r + \gamma V_k(s')\right)$ $V_{k+1}(s) \leftarrow (1-\alpha)V_k(s) + \alpha\left(r + \gamma V_k(s')\right)$

In this equation $LaTeX: \alpha$ $\alpha$ is the learning rate, LaTeX: r $r$ is the reward, $LaTeX: r + \gamma\cdot V_k(s')$ $r + \gamma\cdot V_k(s')$ is called the TD target. $LaTeX: \gamma_k(s,r,s') = r+\gamma\cdot V_k(s') - V_k(s)$ $\gamma_k(s,r,s') = r+\gamma\cdot V_k(s') - V_k(s)$ is called the TD error. $LaTeX: \gamma$ $\gamma$ is the discount factor and LaTeX: V_k(s) $V_k(s)$ is the current estimate of the value function at state LaTeX: s $s$ .

Q-learning

The Q-learning algorithm is an adaptaion of the Q-value iteration algorithm to the situation where the trainition probabilities and rewards are initially unknown. Q-learning works by watching an agent play (randomly) and gradually improving its estimates of the Q-values. One done, the optimal policy is chosen greedily.

$LaTeX: Q_{k+1}(s,a) := (1-\alpha)Q_{k}(s,a) + \alpha\big(r + \gamma \max_{a'}(s', a')\big)$ $Q_{k+1}(s,a) := (1-\alpha)Q_{k}(s,a) + \alpha\big(r + \gamma \max_{a'}(s', a')\big)$

Approximate Q-learning

Instead of tuning probabilities and rewards for each state-action pair LaTeX: (s,a) $(s,a)$ , we parametrize the relationship using a class of models, where we use learning to find the best model out of this class. If we use neural networks, we usually call it Deep Q-Learning.

Instability fix for Q-learning

Recall Brian Anderson's seminar last fall. There is (need to be) a natural order of time constants of an adaptive scheme.

Sampling rate > system dynamics > model updates > changes to physical parameters

That is, you must sample fast enough to capture the system dynamics, but you must update your parameter estimates at a slower rate than the dynamics of the system. You must also update your mode parameters at a faster rate than they change in real life.

This is implemented in Double DQN, where you use two models. One to predict the Q-values of the actions applied to the current state (online model) and the other to estimate the Q-Values for these best actions (target model).

Prioritized Experience Replay

Sample based on magnitude of the estimation error, rather than uniformely. You need to watch out, because now your training data will be biased towards these events. Thus you need to weight down these events for your training.

Architecture - Q-learning using TF-agents

Pseudoimplementation of Deep Q-learning sing TF-agents library

Import suite_atari, AtariPreprocessing, FrameStack 4. Load environment, with framestack, apply AtariPreprocessing.. See page 648
Convert to TFPyEnvironment, page 648
Specify a QNetwork, page 650
Specify an agent, e.g. DqnAgent, page 652
Create a replay buffer, e.g. tf_uniform_replay_buffer, page 654
Specify Metrics, page 655
Specify a collect driver, e.g. DynamicStepDriver, page 656
Precollect random data and generate a tf data set objects, page 658
Write a training loop, page 661